A comparison of novel and traditional numerical methods for the analysis of modern pollen assemblages from major vegetation–landform types

A comparison of novel and traditional numerical methods for the analysis of modern pollen assemblages from major vegetation–landform types

Review of Palaeobotany and Palynology 210 (2014) 22–36 Contents lists available at ScienceDirect Review of Palaeobotany and Palynology journal homep...

2MB Sizes 0 Downloads 155 Views

Review of Palaeobotany and Palynology 210 (2014) 22–36

Contents lists available at ScienceDirect

Review of Palaeobotany and Palynology journal homepage: www.elsevier.com/locate/revpalbo

A comparison of novel and traditional numerical methods for the analysis of modern pollen assemblages from major vegetation–landform types V.A. Felde a,b,⁎, A.E. Bjune a, J.-A. Grytnes b, H.J.B. Birks b,c,d,e,⁎⁎ a

Bjerknes Centre for Climate Research, Uni Research Climate, Allégaten 55, N-5007 Bergen, Norway Department of Biology, University of Bergen, PO Box 7803, N-5020 Bergen, Norway Bjerknes Centre for Climate Research, Allégaten 55, N-5007 Bergen, Norway d Environmental Change Research Centre, University College London, Gower Street, London WC1E 6BT, UK e School of Geography and the Environment, University of Oxford, South Parks Road, Oxford OX1 3QY, UK b c

a r t i c l e

i n f o

Article history: Received 26 November 2013 Received in revised form 19 June 2014 Accepted 29 June 2014 Available online 9 July 2014 Keywords: Central Canada Modern pollen data Vegetation–landform type Numerical methods Supervised learning mode Unsupervised learning mode

a b s t r a c t In recent decades there has been a major development of numerical and statistical methods with considerable potential for the handling, analysis, and summarisation of complex multivariate pollen data. We compare ‘traditional’ ordination and hierarchical clustering methods with recently-developed methods such as principal curves, non-hierarchical agglomerative clustering, multivariate classification trees, and random forests by investigating modern pollen–vegetation relationships using the well-known surface pollen data-set of Lichti– Federovich and Ritchie (1968) from the Western Interior of Canada. Our results show that the modern pollen data display well the major vegetation–landform types from which the surface samples were collected independent of which numerical method is used. Ordination methods partly capture the geographical position of the lakes from which the samples were collected. Different ordination methods produce similar results. Principal curves are not superior to the results from conventional ordination methods. Ward's hierarchical clustering method performs as well as other agglomerative clustering techniques. Multivariate classification trees and random forests produce similar results, but the latter has smaller prediction errors. In general, the novel methods do not provide any new insights into the data, nor do they have any impact on previous conclusions. However, these novel methods have several advantages over traditional techniques because they make fewer assumptions about the underlying character or structure of the data, and they include effective ways of displaying the quantitative importance of different variables. As it is not possible to recommend one method over another, our conclusion is that there is now a large number of robust numerical methods that can be used to detect the underlying patterns within modern pollen assemblages. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Pollen analysis is one of the most important tools for reconstructing past floristic and vegetational changes over time scales from hundreds to thousands of years. Acquiring a detailed understanding of how pollen composition reflects the surrounding vegetation in both qualitative and quantitative terms has received much attention (see Davis, 2000 for a review). Such an understanding provides a factual basis for reconstructing how vegetation has changed through time. This understanding is particularly critical because the relationship between a

⁎ Correspondence to: V.A. Felde, Uni Research Climate, Allégaten 55, N-5007 Bergen, Norway. Tel.: +47 5558 8127 (office). ⁎⁎ Correspondence to: H.J.B. Birks, Department of Biology, University of Bergen, PO Box 7803, N-5020 Bergen, Norway. Tel.: +47 5558 3350 (office). E-mail addresses: [email protected] (V.A. Felde), [email protected] (A.E. Bjune), [email protected] (J.-A. Grytnes), [email protected] (H.J.B. Birks).

http://dx.doi.org/10.1016/j.revpalbo.2014.06.003 0034-6667/© 2014 Elsevier B.V. All rights reserved.

pollen assemblage and the surrounding vegetation is not simple as plants have different pollen production and dispersal characteristics (e.g. Davis, 1963, 2000; Andersen, 1970). Investigating how pollen reflects the surrounding vegetation is not a trivial task because both pollen and vegetation data are usually highly multivariate and complex. The development of numerical methods for analysing such data in the last 40 years has therefore been important for providing repeatable and consistent means for detecting, summarising, and displaying variation and similarities in modern pollen composition and for assessing how well modern pollen assemblages reflect contemporary vegetation (Birks and Gordon, 1985; Birks, 2013). There are several ways of studying modern pollen–vegetation relationships (Birks and Birks, 1980; Birks and Gordon, 1985). Many focus on differences in pollen–plant representation and develop ways of modelling and quantifying the unequal abundances between individual taxa present in both the surrounding vegetation and the pollen assemblage (e.g. Davis, 1963, 2000; Andersen, 1970, 1978; Sugita, 1994,

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

2007a,b). Others focus on assessing how well a modern pollen assemblage as a whole reflects the surrounding vegetation in terms of investigating the relationships between pollen assemblages and the vegetation or vegetation–landform types within which the palynological sites and their surrounding vegetation belong (e.g. Birks et al., 1975). A traditional way of comparing modern pollen composition and vegetation types quantitatively has involved multivariate techniques such as canonical variates analysis, also known as multiple discriminant analysis, to detect how modern pollen assemblages reflect the surrounding vegetation types given that the assignment of the pollen samples to the contemporary vegetation types is known a priori (e.g. Birks et al., 1975; Birks and Gordon, 1985). These analyses provide information about how well modern pollen assemblages are able to discriminate between vegetation types today. When a priori knowledge about the surrounding vegetation types is lacking or the aim is to investigate the underlying inherent structure within the pollen data, hierarchical clustering and ordination techniques have been used to investigate patterns in pollen composition based solely on the pollen assemblages (e.g. Birks et al., 1975; Birks and Gordon, 1985). The vegetation groups can then be identified by investigating the clusters or structure in the data visually. After establishing the numbers of clusters in the pollen data, the composition of these pollen-based clusters can be compared with the composition of the known vegetation groups (e.g. Birks et al., 1975; Birks and Gordon, 1985). These two general approaches, both used by Birks et al. (1975), represent two alternative ways of investigating data structure (Hastie et al., 2011a; James et al., 2014). The first approach uses external variables (e.g. known vegetation types) to guide the analysis and to predict external features from additional independent data. It is thus a supervised learning technique (Simpson and Birks, 2012). The other approach with no external explanatory variables or constraints such as a priori information about vegetation type is an unsupervised technique. Here the aim is not to predict but to detect and describe the basic underlying structure in the data (Simpson and Birks, 2012). During recent decades several new ways of analysing multivariate data within these two broad approaches have been developed but to date few of these methods have been used by palaeoecologists (Birks et al., 2012). New and more robust numerical methods are being continually developed, facilitated by improved computer technology and easily accessible free software to do the computational analyses (R Core Team, 2013). Books on numerical methods in palaeoecology thus soon become outdated (e.g. Birks and Gordon, 1985), and R packages (e.g. Juggins, 2013; Oksanen et al., 2013) and a flow of new books (e.g. Borcard et al., 2011; Wehrens, 2011; Birks et al., 2012; Legendre and Legendre, 2012; Crawley, 2013; Lantz, 2013; James et al., 2014) help to inform palaeoecologists about what possibilities exist today for robust data analysis of complex multivariate data-sets. Examples of new methods within the supervised approach (Simpson and Birks, 2012) are multivariate regression trees, classification trees, and random forests that can be used to assess how well pollen composition predicts vegetation types. Pelánková et al. (2008) used multivariate regression trees to investigate the relationship of modern pollen composition to vegetation and climate in the Siberian steppeforest–tundra transition, and Herzschuh and Birks (2010) used this approach to find indicator values of pollen and spore taxa in relation to modern vegetation and climate on the Tibet–Qinghai Plateau. These and related problems can also be explored using random forests (Simpson and Birks, 2012; James et al., 2014). Random forests are generally more robust than multivariate regression or classification trees because they make fewer underlying assumptions and are based on bootstrapping techniques (Breiman, 2001). Many different divisive clustering techniques based on the least sum-of-squares principle have been developed for investigating the structure of large data-sets (Legendre and Birks, 2012a; Legendre and Legendre, 2012) and are examples of unsupervised methods (Simpson and Birks, 2012; James et al., 2014). These include K-means partitioning, spherical K-means

23

clustering, fuzzy clustering, and partitioning around medoids, none of which appear to have been used in pollen analysis. For these methods the number of groups must be specified a priori. Along with developments in clustering methods, several numerical criteria have been developed to investigate the basic structure in the data and to delimit a robust number of discrete groups (Milligan and Cooper, 1985; Tibshirani et al., 2001; Mirkin, 2011). This has also led to the development of indices that can be used to assess the performance of clustering by comparing the resulting clusters with an a priori grouping. This also allows quantitative comparisons between results from different cluster methods in relation to each other. A robust index for such comparisons is the adjusted or modified Rand Index which accounts for inherent randomness (Hubert and Arabie, 1985; Gordon, 1999; Legendre and Birks, 2012a). Ordination methods have been widely used in pollen analysis during the last fifty years and they have become standard tools for detecting structure in palaeoecological data and for the low-dimensional visualisation of such data (Birks and Gordon, 1985; Legendre and Birks, 2012b; Legendre and Legendre, 2012; Birks, 2013). The most appropriate ordination method to use in vegetation science has been extensively discussed in the ecological literature during recent decades and contrasting opinions exist (e.g. ter Braak, 1986; ter Braak and Verdonschot, 1995; Legendre and Birks, 2012a; Legendre and Legendre, 2012; ter Braak and Šmilauer, 2012; Austin, 2013; Oksanen et al., 2013; Šmilauer and Lepš, 2014). This is because different ordination methods fulfil different criteria and have contrasting underlying assumptions. However, others have suggested that the choice of ordination method applied in exploratory data analysis may not have a great influence on the results if there is a strong underlying gradient, as is often the case when investigating modern pollen composition in relation to vegetation types (Prentice, 1980). By using Procrustes and Protest analyses, developed to compare two ordinations of the same set of objects and to test the statistical significance of the correlation between ordinations, a quantitative comparison of results from different ordination methods is now possible (Digby and Kempton, 1987; Jackson, 1995; Birks, 2012; Legendre and Birks, 2012b). The major aim of this paper is to assess the results from applying several of these novel numerical methods to compare modern pollen assemblages in relation to major vegetation types. It is expected that the numerical methods are robust, and as long as the palynological signal is clear the choice of method will not greatly affect the results. The results from five unsupervised ordination methods (principal component analysis, correspondence analysis, detrended correspondence analysis, principal coordinate analysis, non-metric multidimensional scaling) are compared using Procrustes analysis and the statistical significance of the Procrustes correlation is assessed by the Protest randomisation test (Peres-Neto and Jackson, 2001; Birks, 2012). Comparison is also made between the different ordination results and the geographical coordinates of the lakes from which the modern pollen assemblages were collected to assess if the palynological patterns reflect the underlying geographical structure of the data. Principal curves (De'ath, 1999) are also used to summarise variation in the first ordination axis. The supervised techniques of multivariate classification trees, random forests, and constrained ordinations are used to establish how well pollen composition predicts major vegetation types. Unsupervised clustering techniques such as hierarchical clustering using Ward's method, K-means, spherical K-means, TWINSPAN, fuzzy clustering, and partitioning around medoids are also conducted using both a known and an unknown number of groups a priori. Finally, the similarities of the resulting clusters and the actual a priori vegetation types are assessed using the adjusted Rand Index. In achieving our major aim, we introduce a suite of new numerical tools into palynology that have not, as far as we know, been used in the analysis of pollen-analytical data. These include Procrustes and Protest analyses, spherical K-means, fuzzy clustering, partitioning around medoids, multivariate classification trees, and principal curves.

24

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Borcard et al. (2011), Legendre and Legendre (2012), Birks et al. (2012), and James et al. (2014) provide clear introductions to these and other novel techniques for multivariate data analysis. 2. Data and numerical methods 2.1. Data The data used for the numerical analyses are modern pollen assemblages from surface-sediment samples from 131 lakes in Nunavut, Manitoba, and Saskatchewan, in the Western Interior of Canada, analysed by Lichti-Federovich and Ritchie (1968). These data are here termed the ‘central Canadian data’ (Birks et al., 1975). The samples were collected across a broad vegetation gradient from the arctic tundra and forest– tundra in the north, through open and closed boreal forest, to deciduous forest, parkland, and grassland in the south (Fig. 1). The eight “broadly uniform physiognomic” vegetation zones in the Western Interior of Canada (Lichti-Federovich and Ritchie, 1968) closely correlate spatially with several meso-scale climatic parameters (Bryson, 1966). However, superimposed on the broad north–south vegetation zonation are variations in the zonal vegetation related to, for example, topography, soil cover, and glacial, alluvial, or marine deposits (Lichti-Federovich and Ritchie, 1968). Lichti-Federovich and Ritchie (1968) thus delimited a series of vegetation–landform types (e.g. forest–tundra on uplands, forest–tundra on lowlands) within the major vegetation zones (e.g. forest–tundra). These vegetation–landform types are mappable units of distinctive landform and vegetation (Ritchie, 1962) and range in area from 1230 ha to 135,000 ha in their study area (Lichti-Federovich and Ritchie, 1968). Janssen (1973, 1981) explores the relevance of vegetation–landform types at a range of spatial scales in understanding modern pollen assemblages and the consistency and spatial resolution of pollen spectra within and between vegetation–landform types. Lichti-Federovich and Ritchie (1968) divided the eight broad vegetation zones into thirteen vegetation–landform types based on dominant taxa, geographical location, and landform. These thirteen types are tundra (tr: samples 1–14), forest–tundra upland and lowlands (ft1: 15–22, ft2: 23–24), open coniferous forest east (ocf-e: 25–35), open coniferous forest west (ocf-w: 36–38), closed coniferous forests a, b, and c (ccf-a: 62–69, ccf-b: 39, 111–131, ccf-c: 40–61), mixed forest upland and lowlands (mf1: 82–93, mf2: 70–81, 94, 95), deciduous forest (df: 105–108), aspen parkland (ap: 96–104), and grassland (gr: 109, 110) (see Fig. 1). Lichti-Federovich and Ritchie (1968) discussed the similarities and differences in the modern pollen assemblages between and within the eight vegetation zones and the thirteen vegetation–landform types. Later these data were used in a pioneering study by Birks et al. (1975) who compared different numerical methods to analyse the 131 pollen assemblages based solely on their composition, thereby allowing numerical criteria to become the basis for assessing the compositional similarities between samples. They also used canonical variates analysis to explore how pollen composition reflected the vegetation–landform types that were used as a priori groupings in the analysis. In the Birks et al. (1975) analysis, samples from ft1 and ft2 were combined into one ft type, and ocf-e and ocf-w were combined into one ocf type. The methods used by Birks et al. (1975) (principal component analysis, canonical variates analysis, Ward's sum-ofsquares agglomerative clustering) were ‘state-of-the-art’ 40 years ago. Birks and Gordon (1985) used these data further to demonstrate the application of other clustering and ordination methods in handling modern pollen-assemblage data. These central Canadian data are therefore

25

well-known and are an ideal test data-set for introducing and evaluating new numerical methods. We therefore continue in this tradition and apply a range of new numerical methods to these data, and compare our results with results obtained from the methods used earlier. Pollen assemblages from similar vegetation–landform types are often dominated by the same taxon (e.g. Pinus), and these assemblages are often difficult to separate. The original 13 vegetation–landform types of Lichti-Federovich and Ritchie (1968) were therefore amalgamated into broad groupings of 11, 8, 5, and 3 vegetation–landform types representing different regional scales of variation (Birks and Gordon, 1985) (see Table A1, Appendix A). The pollen data consist of 19 terrestrial pollen taxa that have a value of greater than 2.5% in any sample in the original data-set (Birks et al., 1975; Birks and Gordon, 1985). Inclusion of the 100 + pollen and spore types identified and recorded by Lichti-Federovich and Ritchie (1968), many of which are of rare occurrence and/or low numerical representation, has little or no effect on the results of the numerical methods used by Birks et al. (1975) and Birks and Gordon (1985) (HJB Birks unpublished results). To permit comparability with these earlier numerical analyses, the same 19 pollen taxa were used in this study. Experience with other data-sets and the numerical techniques presented here similarly shows that rare or poorly represented taxa have little or no effect on the results of, for example, multivariate classification trees (Felde et al., 2014). 2.2. Numerical methods The account of the methods used is divided into methods for analysing the data in an unsupervised mode followed by those used in a supervised mode (Simpson and Birks, 2012; James et al., 2014). The unsupervised and supervised techniques represent two different ways of analysing data. However, they are complementary and are equally valuable. Some of the methods employed in this study can be used in both supervised and unsupervised modes (Hastie et al., 2011a; Simpson and Birks, 2012; James et al., 2014) (Table 1). 2.2.1. Unsupervised learning methods Five common unconstrained or classical ordination methods – principal component analysis (PCA), correspondence analysis (CA), detrended correspondence analysis (DCA), principal coordinate analysis (PCoA), and non-metric multidimensional scaling (NMDS) – were used to detect and summarise the latent structure within the modern central Canadian pollen assemblages. The pollen percentage data were square-root transformed prior to all ordination analyses (Prentice, 1980; Legendre and Birks, 2012b). PCoA and NMDS were implemented using either the Bray–Curtis dissimilarity coefficient or Euclidean distances (Legendre and Birks, 2012b). Differences between the results from these five ordination methods were investigated by a generalised Procrustes analysis to assess the similarity between all pairs of ordination results (Gower, 1975; Prentice, 1980; Digby and Kempton, 1987). A Protest randomisation test was used to assess the statistical significance of these Procrustes comparisons (Jackson, 1995; Peres-Neto and Jackson, 2001; Birks, 2012). Procrustes analysis attempts to find the best fit between two ordinations of the same objects by scaling, reflection, centring, and rotation (Gower, 1975; Digby and Kempton, 1987; Peres-Neto and Jackson, 2001). Distances between the matching pairs of objects are calculated as sum-of-squares. The Protest randomisation test determines whether the sum of the residual deviations is less than expected by chance (Jackson, 1995). Birks et al. (1975) hypothesise

Fig. 1. Map of the distribution of the surface-sediment samples from 131 lakes in Manitoba, Saskatchewan, and Nunavut, central Canada, collected by Lichti-Federovich and Ritchie (1968) and the boundaries of the major vegetation–landform types. The map is made in ArcMap 10, and is a modified version of the map in Lichti-Federovich and Ritchie (1968). It shows the location of the modern samples in relation to the vegetation–landform types. Each sample is identified by the sample number. The inset map shows the location of the study area. Abbreviations: tr = tundra; ft-1 = forest–tundra on uplands; ft-2 = forest–tundra on lowlands; ocf = open coniferous forest, e = eastern, w = western; ccf = closed coniferous forest, with three sub-zones a, b, and c; mf = mixed coniferous–broadleaved deciduous forest; df = deciduous forest; ap = aspen parkland; gr = grassland. For further details, see the Data section in the text, Table A1, Appendix A, and Lichti-Federovich and Ritchie (1968).

26

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Table 1 Summary of the numerical techniques used in this study in an unsupervised and a supervised learning mode. Methods

Mode Unsupervised

Classical ordinations Constrained (canonical) ordinations Principal curves Hierarchical clustering K-means partitioning Partitioning around medoids Fuzzy clustering TWINSPAN Spherical K-means Multivariate classification trees Random forests

Supervised

+ + + + + + + + +

+ +

+ +

that the first major gradient in their ordinations of the central Canadian pollen data reflects the south–north palynological and underlying broad-scale vegetation zonation and that the second gradient in their ordinations reflects the east–west palynological and underlying vegetation gradient in central Canadian vegetation today. To assess if the results of the various ordination methods reflect the geographical position of the lakes within the study area, we compared all the resulting first and second ordination axes with the geographical coordinates of the 131 lakes sampled. The results of these comparisons were then summarised by constructing a similarity matrix of the Procrustes sum-of-squares and summarised in a PCoA (Digby and Kempton, 1987). Plots from the various individual ordination methods are not included in this paper as some of them have been previously published in Birks et al. (1975) and Birks and Gordon (1985). All the ordination methods were done in R (R Core Team, 2013) using the vegan package (Oksanen et al., 2013). Principal curves (PCs) (De'ath, 1999; Hastie et al., 2011a) were applied to summarise the palynological compositional turnover along the first ordination axis. This is done by fitting a smooth curve through the data in m dimensions (where m is the number of variables or the number of objects, whichever is smaller) (De'ath, 1999). The algorithm is conducted in two steps where the starting points for each observation are determined from the object scores using the first axis of an ordination method such as PCA, CA, or PCoA (De'ath, 1999; Simpson and Birks, 2012). The second step involves minimising the sum of distances between the projected points and the observed data (De'ath, 1999; Simpson and Birks, 2012). The choice of initial curve, type of smoother, and the span of the smoother used are parameters that affect the principal curves (De'ath, 1999). Different starting points using PCA, CA, PCoA, and NMDS were compared, and the best-fitted curve was determined by smoothing splines using generalised cross-validation to determine the degree of smoothness. The pollen values were square-root transformed prior to analysis. The analyses were done in R using the pcurve package (Hastie et al., 2011b). Clustering and partitioning methods can be regarded as complementary to ordination, but they are more specialised in that they are designed to place objects or variables into groups a posteriori (Prentice, 1980; Legendre and Birks, 2012a; Legendre and Legendre, 2012). For pollen-analytical data, the most common approach is to use agglomerative hierarchical cluster methods (Birks and Gordon, 1985). More advanced divisive methods such as two-way indicator species analysis (TWINSPAN) are used occasionally. The clustering methods used here are regular K-means partitioning (KM), spherical K-means clustering (SKM), TWINSPAN, partitioning around medioids (PAM), and fuzzy C-means clustering. KM and PAM are methods developed for investigating clusters in large data-sets based on similarities or dissimilarities between objects (Borcard et al., 2011). KM uses the local structure of the data to identify high-density regions in the data and

clusters are determined by minimising sum-of-squared Euclidean distances (Borcard et al., 2011). PAM is closely related to KM, but minimises the sum of dissimilarities instead of using the least sum-ofsquares principle (Borcard et al., 2011). Fuzzy clustering is different in the way it estimates the membership coefficient of clusters. In this case the resulting ‘hard’ clusters are defined by the highest membership coefficient from each object (Gordon, 1999). TWINSPAN provides a biclustering of both objects and variables and was originally developed to analyse large and heterogeneous species-site matrices in plant ecology (Hill and Šmilauer, 2005; Fielding, 2007). What is different about this method is that it uses conjoint coding and the concept of pseudospecies to transform the data. This means that one variable may be represented several times but with different numerical levels (Hill and Šmilauer, 2005). The pseudo-species cuts need to be specified a priori, and were set here to 0, 2, 4, 8, 16, 32, and 64. The program WinTWINS version 2.3 (Hill and Šmilauer, 2005) was used. Another version of biclustering is spherical K-means clustering (SKM). This method has been used in data mining, and recently Hill et al. (2013) and Preston et al. (2013) introduced it into ecology and plant geography. The algorithm of SKM clustering provides variable and object clusters at the same time, and is based on minimising the sum of squared chord or perpendicular distances (Hill et al., 2013). Both numbers of objectgroupings and variable-groupings are parameters that need to be specified in advance. The SKM clustering was done using the SPHERIKM program (Hill, 2012). We used perpendicular distances because this works well with biogeographical data (Hill et al., 2013), and a standard weighting of 1 for distances. Taxon variables were clustered first followed by the objects. This is referred to as R-mode clustering and can identify potential patterns in the taxon data that cannot be detected if taxa are clustered based on a clustering of the objects. This approach and the parameters mentioned above can be changed in SPHERIKM (Hill, 2012; Hill et al., 2013). A major challenge in cluster analysis is that very often the optimal number of clusters is not known a priori (Tibshirani et al., 2001). In all our cluster analyses we assumed a specific number of underlying vegetation–landform types. Because it can be difficult to separate certain vegetation–landform types, the numbers of vegetation–landform groups (k) were set to 11, 8, 5, and 3 (see Table 1, Appendix A). An alternative approach can be used where the number of vegetation–landform types is not known. The numbers of potential groups can then be determined by using a gap-statistic for KM and PAM (Tibshirani et al., 2001), a subjective assessment of the hierarchical clustering results, a TWINDEND analysis of the TWINSPAN results (TWINDEND compares TWINSPAN within-group sum-of-squares of clusters; HJB Birks and JM Line unpublished program), and Akaike information criteria (AICs) for SKM (Hill et al., 2013). KM, SKM, PAM, and fuzzy clustering were done on squareroot transformed pollen percentage data, while TWINSPAN and TWINDEND were conducted on percentage data transformed by conjoint coding with seven pseudo-species cut-levels (see above). To compare quantitatively the differences between the cluster groups based on the pollen assemblages and the vegetation–landform types from which the objects were collected, the adjusted Rand Index was used (Hubert and Arabie, 1985; Gordon, 1999). This index assesses the performance of the different clustering methods compared to random expectation, and gives a value between −1 and +1. The expected value is 0, and an index value of +1 indicates that the two clusterings are identical. A value below 0 indicates that the differences are larger than expected. To summarise all the clustering results together, a matrix of the adjusted Rand Index values for all pairs of clusterings was used in PCoA with Euclidean distance as the dissimilarity measure to compare the performance of each clustering method relative to each other and to the central Canadian vegetation–landform types (Fig. 1 and Table A1, Appendix A). All analyses except TWINSPAN and TWINDEND and spherical K-means clustering were done in R using the cluster package (Maechler et al., 2013) and the flexclust package (Leisch, 2006).

0.2 0.0

coord

CA

−0.2

PCoA 2 (5.0%)

sq−NMDS bc−PCoA sq−PCoA RDA−veg PCA bc−NMDS

CCA−veg DCA

−0.4

2.2.2. Supervised learning methods Constrained or canonical ordination methods such as redundancy analysis (RDA) and canonical correspondence analysis (CCA) were applied using 11 vegetation–landform types (Table A1, Appendix 1) as binary explanatory or external variables to assess how well the underlying vegetation–landform types and vegetation gradient are reflected by the modern pollen assemblages. The RDA and CCA results were compared with the results of the unconstrained unsupervised ordination methods by Procrustes and Protest analyses and plotted in the same PCoA space. The constrained versions of principal curves (De'ath, 1999) were also plotted to see if the smoothed curves display patterns similar to the unconstrained principal curves. The same procedure as described under unsupervised principal curves was used but with the RDA or CCA axis 1 as the starting point for the PC. Multivariate classification trees (MCTs) were also used to predict how modern pollen data are able to define the underlying vegetation– landform types (De'ath and Fabricius, 2000; Borcard et al., 2011; Simpson and Birks, 2012; James et al., 2014). MCT is a form of constrained clustering where the tree is built by a recursive algorithm. The partitioning of the data is determined by the minimum withingroup sum-of-squares, and the important variables are defined by the explanatory or predictor variables provided (De'ath and Fabricius, 2000; Borcard et al., 2011). MCT was performed using square-root transformed pollen percentage data, and the sum-of-squares was calculated from Euclidean distances. The different variables for the vegetation–landform types divided into 11, 8, 5, and 3 types (Table A1, Appendix 1) were used one by one, and the best tree was picked by assessing the relative error and the cross-validated relative error (CVRE) (Borcard et al., 2011). The R package mvpart (Therneau et al., 2013) was used to compute the MCTs. Other tree-based methods that can be used to investigate modern pollen–vegetation relationships include bagging, random forests (RFs), and boosted trees (Hastie et al., 2011a; Simpson and Birks, 2012; Lantz, 2013; James et al., 2014). Bagging and random forests were developed by Breiman (1996, 2001), and they are types of methods that are based on large data-sets and bootstrap re-sampling to produce many random trees (Simpson and Birks, 2012). Two-thirds of the data form a training-set while the remaining third is left out of the analyses (i.e. as out-of-bag (OOB) test samples). A tree model is fitted to the training-set data and the prediction errors are calculated on the basis of the OOB samples. The procedure is repeated many times to produce a set of n trees. For classification trees based on RFs, the majority vote rule is used and the class with the largest number of votes is selected as the fitted class for that observation. RFs were chosen because they are considered superior to bagging as they attempt to reduce the effect of inter-correlation of the predictor variables by splitting every tree from a random subset of the available predictors (Hastie et al., 2011a; Lantz, 2013). The number of explanatory variables chosen at random for each node was set to 4, and the number of trees (n) was set to 4000. Also here, the different variables for the vegetation–landform types divided into 11, 8, 5, and 3 types were used one by one. RFs can also be used in an unsupervised mode, where the outcome is a proximity matrix that can be used in regular clustering techniques. This matrix was used for PAM and hierarchical clustering using Ward's sum-ofsquares method to investigate if the unsupervised RF proximity matrix performed better than using traditional distance or similarity coefficients. These results were handled together with the results for the unsupervised cluster analyses, and plotted in the same PCoA space. The RF results were summarised in a confusion matrix which provides an easy comparison of the errors or misassignments for the different vegetation–landform types. Another useful output from RFs is the variable importance measures (James et al., 2014) that can be helpful in detecting which taxa discriminate between the different vegetation–landform types. The results are displayed as a confusion matrix, and the analyses were done in R using the randomForest package (Liaw and Wiener, 2002).

27

0.4

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

PCoA 1 (95.0%) Fig. 2. Comparison of the results of different ordination methods using principal coordinate analysis (PCoA) of the Procrustes sum-of-squares between pairs of ordinations. The names indicate which method is used. bc-NMDS and bc-PCoA are abbreviations for nonmetric multidimensional scaling (NMDS) and PCoA using the Bray–Curtis (bc) coefficient, while sq indicates that squared distances are used. Coord is short for the geographical coordinates. The percentage pollen data were square-root transformed prior to principal component analysis (PCA), correspondence analysis (CA), detrended correspondence analysis (DCA), redundancy analysis (RDA), and canonical correspondence analysis (CCA). RDA-veg and CCA-veg are abbreviations for constrained ordination using 11 vegetation–landform types as binary constraints or external predictors. The amount of variation captured by the two PCoA axes is shown in brackets on the axis labels. The distances between the points represent the total Procrustes sum-of-squares in relation to the various methods. The points closest to each other indicate that their ordination results are similar.

3. Results 3.1. Comparison of unsupervised and supervised ordination methods The results from the comparison between the ordination methods are summarised in a principal coordinate analysis of the Procrustes sum-of-squares values (Fig. 2). In general, the different ordination methods show similar results. The Protest randomisation test has correlation values between 0.8 and 0.9 which are highly significant statistically (p = 0.001). The variation explained by the first and second PCoA axes is 95.0% and 5.0%, respectively. The first PCoA axis separates all the ordination results from the sampled lake geographical coordinates, suggesting that the pollen-based ordinations are not a direct reflection of the lake geographical localities. DCA, CA, and CCA produce results that differ more from the other methods as shown by their position on the second PCoA axis (Fig. 2), while the points representing Bray–Curtis distance-based PCoA, Bray–Curtis distance-based NMDS, Euclidean distance-based PCoA, Euclidean distance-based NMDS, PCA, and RDA are relatively close to each other. The largest difference between the points seems to be affected by the distance coefficients used (Prentice, 1980), as the second PCoA axis separates those methods that are based on Euclidean and Bray–Curtis distances against the chisquare metric implicit in correspondence analysis and its close relatives of DCA and CCA. To assess how well the unconstrained ordination methods capture the underlying vegetation gradients, the RDA and CCA results using the vegetation types divided into 11 vegetation–landform types were compared with the unconstrained ordinations (ter Braak, 1986) and the results mainly confirm that the response variables (pollen types) detect the major vegetation–landform types (predictor variables) in the data with a high level of confidence. The Protest correlations between the distributions of objects based on the pollen assemblages

28

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Unsupervised

Supervised

131 121 111 101 91

Site number

81 71 61 51 41 31 21 11 1

Curves: PCA PC_PCA 0.0

0.4

Curves: CA PC_CA 0.8

Proportional distance

0.0

0.4

Curves: NMDS PC_NMDS 0.8

Proportional distance

0.0

0.4

0.8

Proportional distance

0.0

0.4

Curves: CCA PC_CCA

Curves: RDA PC_RDA

Curves: bcPCoA PC_bcPCoA 0.8

Proportional distance

0.0

0.4

0.8

Proportional distance

0.0

0.4

0.8

Proportional distance

Fig. 3. Comparison of PCA, CA, NMDS, and PCoA ordination axes and their associated principal curves in an unsupervised mode and of RDA and CCA ordination axes and their associated principal curves in a supervised mode. The proportional distances are the distances along the principal curves normalised to the range (0,1) to allow direct comparison between methods (Simpson and Birks, 2012). Abbreviations follow Fig. 2. Site number is sample number as in Birks et al. (1975).

and the distribution of the objects in geographical space for the different methods vary between 0.464 and 0.591. The highest correlations are for CA, CCA, and DCA (r = 0.581, 0.591, 0.591), all with statistically significant p-values (p = 0.001). Consistent results among the ordinations along the first axis can also be observed in the comparison between the different principal curves (PC) and the first ordination axes. Fig. 3 shows the comparison of the first axes of unsupervised PCA, CA, Bray–Curtis distance-based PCoA, and Euclidean distance-based NMDS ordinations, and the principal curves using these axes as starting points. Fig. 3 also shows the PCs based on RDA and CCA. These PCs indicate that the main palynological structure is captured by the first gradient in all these ordination methods. This is not unexpected because the proportion of variation captured by the first ordination axis summarises almost more than half of the variation in the data captured by the second axis. However, the choice of initial ordination method has little impact on the final principal curves. The smoothed principal curves consistently capture a higher proportion of the total variation compared to the conventional ordination axes: PCA: 41.6% (PCA) vs. 76.13% (in the PCA-based PC); Bray–Curtis distance-based PCoA: 69.7% vs. 77.91%; CA: 37.8% vs. 79.32%; Euclidean distance-based NMDS: 69.5% vs. 73.26%; RDA: 55.45% vs. 74.79%; and CCA: 58.6% vs. 72.69%. CA is the method that captures the highest proportion of the total variation in the PCs. The smoothed curves are also affected by the taxon response curves, and overly complex response curves may indicate that the smoothed principal curve is over-fitted (De'ath, 1999). In Fig. 4 it is possible to see that some of the pollen taxa have complex response curves along the first

principal curve based on CA (e.g. Betula). However, this has an important palynological explanation since one pollen morphological type may represent more than one plant species which could result in multi modes at several places along the underlying latent gradient e.g. Betula pollen represents all the species within this genus which may grow in different vegetation–landform types in central Canada (e.g. Betula nana, B. cordifolia, B. papyrifera, B. nigra, and B. alleghaniensis). 3.2. Comparison of unsupervised and supervised clustering methods The comparisons of the unsupervised clustering methods using various divisive approaches basically show that none of the clustering methods has a clear advantage over the others in relation to each other and the underlying vegetation–landform types (Fig. 5). The distances between the points are based on the adjusted Rand Index. From these results it is difficult to define which methods perform best. In terms of detecting the ecologically meaningful vegetation–landform types all the methods perform more or less equally well, although all of the methods deviate slightly from the underlying vegetation–landform types. Reducing the scale of vegetation–landform types by using broader vegetation–landform types has surprisingly little effect on the clustering performance. However, methods that are based on the same underlying numerical principles produce more similar results (i.e. regular K-means partitioning, spherical K-means, hierarchical agglomerative methods using Ward's sum-of-squares method) than methods that are based on different data input (random forest

0 20 40 60 80 100 120

0 20 40 60 80 100 120

Alnus Chenopodiineae 0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

Quercus

0.25 0.20 0.15 0.10 0.05 0.00

Sample no.

0.05

0 20 40 60 80 100 120

0.00

0.5 0.4 0.3 0.2 0.1 0.0

Sample no.

0.05 0.00 0 20 40 60 80 100 120

Corylus

0.10

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0.00

0.10

0.25 0.20 0.15 0.10 0.05 0.00

0.20 0.15 0.10 0.05 0.00 0 20 40 60 80 100 120

0.05

Larix

0.10

0.15

Fraxinus

0.15

0.20 0.15 0.10 0.05 0.00

0.15

0 20 40 60 80 100 120

Betula

0.5 0.4 0.3 0.2 0.1 0.0

Lycopodium

0.1

0.6 0.5 0.4 0.3 0.2 0.1

0 20 40 60 80 100 120

0.2

0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

0 20 40 60 80 100 120

Tubuliflorae

Ericaceae

RumexOxyria

0.3

29

0.20

0.20

Populus

0.6 0.5 0.4 0.3 0.2 0.1 0.0

0 20 40 60 80 100 120 0.4

0 20 40 60 80 100 120

Artemisia

0.5

0 20 40 60 80 100 120

Ambrosieae

0.30 0.25 0.20 0.15 0.10 0.05 0.00

Cyperaceae

0.5 0.4 0.3 0.2 0.1 0.0

0 20 40 60 80 100 120

Salix

0.30 0.25 0.20 0.15 0.10 0.05 0.00

0.6 0.5 0.4 0.3 0.2 0.1 0.0

0 20 40 60 80 100 120

Pinus

0.8 0.7 0.6 0.5 0.4 0.3

Gramineae

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 40 60 80 100 120

Picea

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Sample no.

Fig. 4. Response curves for the major pollen and spore taxa along the principal curve using CA as the initial starting point (Fig. 3). The curves are fitted smooth splines. Sample numbers (no.) are as in Birks et al. (1975) and Table A1, Appendix 1. The y-axis is the pollen proportions in the data-set. Note that the y-axis scales differ from graph to graph.

proximity matrix) or different approaches (TWINSPAN). The same results are observed when using numerical stopping criteria to define the number of groups in the pollen data (subplot in Fig. 5). TWINDEND defined 8 groups for TWINSPAN, and the AIC defined 8 groups for SKM. The gap statistic defined 10 groups for KM and PAM using the square-root transformed pollen data, while 9 and 10 groups were determined for PAM and hierarchical clustering using Ward's method based on the random forest proximity matrix, respectively. The supervised clustering methods provide information about how robust the results from the unsupervised cluster methods are and help to identify what vegetation–landform types are difficult to separate. The prediction accuracy or errors provide information about how well the pollen data are able to define the underlying vegetation–landform types and which vegetation–landform types are most likely to be confused with others. Results from the multivariate classification trees (MCTs) show that the minimum sized tree with the smallest crossvalidation estimates was ten when using 11 vegetation–landform types, seven for 8 vegetation–landform types, five for 5 vegetation– landform types, and three for 3 vegetation–landform types (Figs. 6–9). Depending on the number of vegetation–landforms types, the relative

error varies between 0.317 and 0.639 (Figs. 6–9) while the prediction accuracy (CV error) varies between 0.388 and 0.675 (Figs. 6–9). This implies that the accuracy of pollen to define the different vegetation–landform types decreases when broader definitions of the underlying vegetation–landform types are used. By inspecting the bar plot at each node it is possible to see how much the individual pollen taxa contribute to each split. The length of the branch indicates the variation explained by the split. The classification trees based on 11 vegetation–landform types show that the pollen composition separates the sites within the aspen parkland, deciduous forest, and grassland (Fig. 1) into one branch while sites within the vegetation–landform types of closed-coniferous forests, forest–tundra, mixed forests, and tundra (Fig. 1) are placed in the second branch (Fig. 6). The latter group is further divided into forest–tundra and tundra and coniferous forests in the second division, and overall, the MCT shows that the vegetation–landform types (Fig. 1) that are difficult to separate from each other palynologically are 1) forest–tundra and tundra, 2) aspen parkland and grassland, 3) mixed upland and lowland forests, 4) closed-coniferous forests b and open coniferous forests, and finally, 5) closed-coniferous forests a and b (Fig. 6).

30

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

rfpam3

0.5 rfpam11 rfhclus11

0.0

twin11 twin8 twin5

pam3

twin3 skm3

fuz11 skm11 rfhclus8 skm5 km3 rfpam5 fuz3 skm8 hclus11 veg11 fuz5 km11 rfhclus5 hclus3 pam11 veg8 km5 hclus8 km8 pam5 fuz8 veg5 pam8 rfhclus3 veg3

0.4

PCoA 2 (38.3%)

PCoA 2 (25.4%)

rfpam8

pam10 veg3

0.0

hclus9 kmgr10 veg5 veg8 rfpam9 veg11 rfhclu10 skm8

−0.4

hclus5

−0.5

twinend8

−0.8

−0.8

−0.40

.0

0.4

PCoA 1 (61.7%)

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

PCoA 1 (74.6%) Fig. 5. Principal coordinate analysis (PCoA) comparison of different clustering methods using a known number of groups and the groupings based on the underlying vegetation–landform types. The PCoA shown is based on the adjusted Rand Index between pairs of clusterings. The percentage variation captured by PCoA axes 1 and 2 is shown in brackets on the axis labels. The subplot shows the comparison of vegetation–landform types when using some stopping criteria. The names are indicated first by the methods that were applied and then the number which represents how many vegetation–landform groups were used (11, 8, 5, 3) (see Table A1, Appendix A) (hclus = hierarchical clustering using Ward's method; pam = partitioning around medioids; km = K-means clustering; skm = spherical K-means clustering; fuz = fuzzy clustering; twin = TWINSPAN; rfpam = partitioning around medioids using random forests dissimilarity matrix, and rfhclus = hierarchical clustering using Ward's method using the random forests dissimilarity matrix). The different colours represent the groupings of the underlying vegetation–landform types that are comparable where red = 11 groups, green = 8 groups, turquoise = 5 groups, and purple = 3 groups. The ‘true’ vegetation–landform types are abbreviated veg11, veg8, veg5, and veg3. The first PCoA axis separates the groups by different size, while the second PCoA axis represents the variation among the various clustering methods. The spatial division of the fine and broad vegetation–landform types seems to be more important than the differences between the clustering methods as the second PCoA axis explains a relatively small portion of the variation (25.4%). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Random forests yield very similar results to the multivariate classification trees. However, the prediction error is much lower than for the MCT (21.37–12.21%). Tables 2–5 present the RF confusion matrices based on the different groupings of the vegetation–landform types (Fig. 1, Table A1, Appendix A) into 11, 8, 5, and 3 types. The low prediction errors from the RFs indicate that this method is more robust than MCT. The prediction errors for the different vegetation–landform type groupings decrease when the number of vegetation–landform types is reduced. The errors that RFs provide for pollen assemblages from each vegetation–landform type in the confusion matrix show that assemblages from forest–tundra, grassland, and mixed forests are often confused with assemblages from open-coniferous forest, aspen parkland, and closed coniferous forests, respectively. For the vector containing the vegetation–landform types divided into 11 types the error prediction for grassland alone is most likely an effect of randomness because there are so few grassland samples (samples 109 and 110) that they could all easily have ended up in the training set or in the OOB samples. However, the results are generally reasonable because non-wooded vegetation types such as grassland will have a higher proportion of background pollen, resulting in false presences of common tree taxa in these samples. To find out which pollen types are separating the different vegetation–landform types in the RFs, their variable importance can be assessed (Fig. 10). Pollen taxa with high values are good discriminators for separating the different vegetation–landform types, although it does not necessarily mean that the taxa have a high occurrence in each vegetation–landform type. Taxa that are not present in that vegetation– landform type may, however, be a good discriminator for other vegetation–landform types.

4. Discussion 4.1. Comparison of numerical methods Our results show that the various novel numerical methods applied here provide strikingly consistent results. The strong south–north vegetation gradient in central Canada is captured by the modern pollen data independently of which numerical method is used and confirms the results obtained in previous studies using traditional numerical methods (Birks et al., 1975; Birks and Gordon, 1985). The purpose of this study was not to provide any new insights regarding modern pollen–vegetation relationships in central Canada, but to assess how useful new numerical methods are for analysing and summarising the patterns in these pollen data. The power of these new approaches for prediction and for clustering data is greater in terms of handling large data-sets and in providing more robust results based on numerous computational iterations than the traditional numerical methods used by Birks et al. (1975). In a statistical sense these novel methods can be considered to be more rigorous and more robust (Simpson and Birks, 2012). Similar results between different ordination methods have been observed by others (e.g. Birks et al., 1975; Prentice, 1980; Birks and Gordon, 1985), but by using Procrustes and Protest analyses the comparison of ordination results can be quantified and evaluated statistically. Our Procrustes results confirm that the various ordination methods produce similar results when most of the variation is captured along the first ordination axis. This is not unexpected because all scaling methods (except NMDS) aim to maximise the variation along the first axis (Legendre and Birks, 2012b; Legendre and Legendre, 2012). Even

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Picea Pinus Betula Alnus Salix Gramineae Cyperaceae Chenopodiineae Ambrosieae Artemisia Rumex.Oxyria Lycopodium Ericaceae Tubuliflorae Larix Corylus Populus Quercus Fraxinus

ccf−a,ccf−b,ccf−c ,ft,mf1,mf2,ocf,tr

ccf−a,ccf−b,ccf−c ,mf1,mf2,ocf

ccf−a,ccf−b,ccf−c,ocf

ccf−a,ccf−c ccf−a

31

ccf−c

ccf−b

ft,tr

mf1,mf2

ccf−b,ocf

mf2

ap,df,gr

ft

ap,gr

tr

df

1.06 : n=11 0.116 : n=4

mf1 0.787 : n=10 0.759 : n=14

ocf 0.789 : n=14

0.747 : n=12

0.15 : n=8 0.502 : n=220.652 : n=22 0.512 : n=14 Error : 0.317 CV Error : 0.388 SE : 0.0298 Fig. 6. Multivariate classification tree of the pollen data using the modern vegetation–landform types grouped into 11 types. The histograms show the abundant taxa within the different groups predicted by the vegetation–landform types. The cross-validated error (prediction accuracy) is 0.388 (38.8%), which is much higher than the prediction error for random forests (20.61%). Abbreviations for the vegetation–landform types follow Fig. 1, the Data section in the text, and Table A1, Appendix A. The colour coding and order of the pollen taxa are the same in Figs. 6–9 to allow ease of comparison.

though there is a theoretical advantage in using non-linear methods such as CA or DCA (basic niche theory where species have a unimodal response along the environmental gradients: ter Braak, 1986; ter Braak and Prentice, 1988), these methods do not provide markedly different results from linear-based approaches such as PCA or PCoA (Fig. 2). However, there are some differences between the ordination results (Fig. 2) that seem to be most affected by the type of distance coefficient used. Our results show that the CA and DCA ordinations differ most from the other methods. The main difference with these methods is that they preserve the chi-square metric between objects instead of Euclidean or related distances (Legendre and Birks, 2012b). It is also surprising how well NMDS does since it is only using rank distances between objects. Comparisons between DCA and NMDS suggest that NMDS is to be preferred when there are complex gradients and nonregular sampling schemes (Minchin, 1987). This is because DCA cannot often recover the second, third, etc. dimensions with success and the values resulting from the detrending of the second axis may be unreliable and uninterpretable (Legendre and Legendre, 2012). However, DCA and CA are the methods that have the highest Procrustes correlations to the geographical distribution of the sites which may indicate that DCA is not necessarily a bad choice. There are very small differences in the RDA and CCA results when the vegetation–landform types are used as external constraints in the supervised mode compared to the PCA and CA results in the unsupervised mode. This implies (ter Braak, 1986) that the vegetation–landform types are the underlying gradients determining the basic structure in the pollen data, and that the pollen data are able to define the surrounding vegetation–landform types

with a relatively high accuracy. Our results show that there are links between the ordination results and the geographical distributions of the sites for all the methods, but the correlation is not very high. Therefore the ordination results presented in two-dimensional space only partly reflect the underlying geographical pattern of the objects. Principal curves are considered to have a great potential in the analysis of palaeoecological data because they can summarise changes in multiple dimensions along a single gradient (Simpson and Birks, 2012). All the curves display the underlying pollen compositional gradient and thus the underlying vegetational gradient in central Canada. However, it is also observed that different ordination methods used as starting configurations produce slightly different results and the principal curves are not strongly superior in summarising the variation compared to the variation captured by the first ordination axes. The most likely explanation is that it is not possible for principal curves to incorporate much additional information as most of the variation is already well captured by the first ordination axis. Smoothing the points along the specific gradient of interest may extract the most important information about the composition and turnover in pollen assemblages over space. From Fig. 3 it seems that the principal curves show less variation when the vegetation is similar and show larger fluctuations when there is a high turnover among sites. Principal curves may therefore be valuable in detecting major changes in species composition across ecotones (e.g. Heegaard et al., 2006; Antoniades et al., 2014; Birks, 2014). Principal curves also have the potential for identifying rapid or significant temporal shifts in fossil assemblages. An example of using principal curves to identify temporal shifts is given by Simpson and Birks (2012)

32

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Picea Pinus Betula Alnus Salix Gramineae Cyperaceae Chenopodiineae Ambrosieae Artemisia Rumex.Oxyria Lycopodium Ericaceae Tubuliflorae Larix Corylus Populus Quercus Fraxinus ccf,ocf ccf

ccf,ft,mf,ocf,tr

ccf,mf,ocf

ap,df,gr

ft,tr

mf

ft

ap,gr

tr

1.06 : n=11

df

0.116 : n=4

ocf 0.787 : n=10 0.759 : n=14 1.94 : n=26

2.48 : n=52 0.512 : n=14 Error : 0.4 CV Error : 0.448 SE : 0.0308 Fig. 7. Multivariate classification tree of the pollen data using the modern vegetation–landform types in 8 groups. The histograms show the abundant taxa within the different groups predicted by the vegetation–landform types. The cross-validated error (prediction accuracy) has increased from 0.388 (11 vegetation–landform types) to 0.448 (8 types), and a smaller proportion of the variation in the data is captured. Abbreviations for the vegetation–landform types follow Fig. 1, the Data section in the text, and Table A1, Appendix A.

who summarised rapid changes in pollen composition at Abernethy Forest in the early Holocene and late glacial. Cluster analysis and ordination techniques have similar aims in that they attempt to explore multivariate data-sets by reducing their dimensionality and summarising the major patterns of variation within the data-sets (Birks and Gordon, 1985; Shi, 1993). The various clustering methods are useful complements to ordination methods as they assign the objects into groups a posteriori (Birks et al., 1975). However, all the new non-agglomerative clustering methods used in this paper need to know a priori the number of groups as an input to the algorithm before running the analyses. If the data are well structured this is not a great problem since it can be relatively easy to define the numbers of potential groups using a robust stopping rule. Several statistical criteria have been developed to assess the structure of the data and to determine the numbers of groups. Milligan and Cooper (1985) tested 30 different criteria for finding the most reliable stopping rule, and they concluded that the Calinski–Harabasz criterion (Legendre and Birks, 2012a) is one of the best. However, they also recognised that these results are based on well-structured data and that the results might vary if the structure in a particular data-set is different. In our case, the Calinski– Harabasz criterion failed to suggest more than two major groups in the central Canadian pollen data-set, simply separating the objects in the south from the rest of the data. Therefore we chose to use different numerical criteria to determine the ‘correct’ number of groups. The AIC values for SKM and the TWINDEND results for TWINSPAN suggest 8 groups, an equal number to one of the groupings of the underlying vegetation–landform types (Table A1, Appendix 1). However, none of these groups were generally more similar to the underlying vegetation–landform types compared to the groupings obtained by other methods with 9 or 10 clusters. The groups that are defined by SPHERIKM are also

influenced by the weighting used and slightly different results can be expected if weights of 0.5 are used instead of 1 (Hill et al., 2013; Preston et al., 2013). Preston et al. (2013) chose an arbitrary cluster number after investigating the data for floristic elements within the British and Irish flora and finding that the statistically optimum number determined by AIC was not particularly useful. They suggest that deciding on the number of clusters should be up to the user and based on the purpose of the study because there is no single ‘correct’ classification. The results in our study support this. However, the numbers of clusters determined by statistical criteria (8–10) were not far from the ‘real’ number of groups (8 and 11). The ecological interpretations based on these different groupings do not differ much from each other. However, using a statistical criterion ensures that the clusterings are reproducible. It is not surprising that it is not possible to identify the optimal number of clusters in this study because according to the random forest results, vegetation–landform types can be misidentified in 21% of the cases. A problem with palynological data is that the data may be collected along a gradient with gradual pollen compositional changes from one group to the other (Birks and Gordon, 1985). The differences in clusterings between methods can thus be a result of relatively arbitrary cuts in the transition zones between different vegetation–landform types. In these cases, when choosing a method, it is preferable to pick one that includes the membership coefficients of the objects for the various vegetation–landform types such as fuzzy clustering. What is most surprising is that the simple Ward's hierarchical clustering using the sum-of-squares principle (Birks et al., 1975; Birks and Gordon, 1985) performs as well as or almost slightly better than the divisive procedures. The latter were developed for data-sets that are much larger than the data-set used here. However, our results show that these methods are also useful in the analysis of smaller data-sets. Spherical

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Picea Pinus Betula Alnus Salix Gramineae Cyperaceae Chenopodiineae Ambrosieae Artemisia Rumex.Oxyria Lycopodium Ericaceae Tubuliflorae Larix Corylus Populus Quercus Fraxinus cf

cf,ft,mf,tr

cf,mf

dfpg

ft,tr

1.76 : n=15 ft

mf

0.787 : n=10 3.48 : n=66

33

tr

0.759 : n=14

1.94 : n=26 Error : 0.456 CV Error : 0.498 SE : 0.0318

Fig. 8. Multivariate classification tree of the pollen data using the modern vegetation–landform types in 5 groups. The histograms show the abundant taxa within the different groups predicted by the vegetation–landform types. The cross-validated error (prediction accuracy) has increased from 0.388 (11 vegetation–landform types) to 0.498 (5 types), and a smaller proportion of the variation in the data is captured. Abbreviations for the vegetation–landform types follow Fig. 1, the Data section in the text, and Table A1, Appendix A.

K-means clustering based on taxa did not show any superior performance compared to regular K-means clustering. The results based on the proximity matrix provided by unsupervised random forests showed no major differences from the other methods. When the true numbers of clusters are known, supervised cluster methods are useful for determining how well these can be separated palynologically. The results from supervised clusterings basically show that some vegetation–landform types are most likely to be clustered in the wrong groups when using unsupervised clustering techniques. Possible reasons for these difficulties include differential pollen production and dispersal of different taxa. Open vegetation–landform types may allow a larger background component of long-distance extraregional (sensu Janssen, 1973, 1981) pollen dispersers such as Pinus and Picea and it is thus difficult to distinguish coniferous forests from mixed coniferous–deciduous vegetation types. What is most surprising are the results showing that the accuracy of distinguishing vegetation– landform types decreases when broader vegetation–landform groupings are used (Figs. 6–9, Tables 2–5). It was expected that using more generalised vegetation–landform types would result in the pollen data having a higher accuracy in determining the various broader vegetation–landform groups. However, this is an interesting result because it indicates that pollen are able to distinguish finer vegetation–landform types with some degree of confidence (Janssen, 1981), and that less variation in the data is captured when broader definitions of the vegetation–landform types are used. Both multivariate classification trees and random forests produce useful results by showing which vegetation– landform types are difficult to separate on the basis of their pollen spectra (Figs. 6–9, Tables 2–5). These techniques can handle complex datasets that include missing values, non-linear relationships between

variables, and high-order interactions (Breiman, 1996; De'ath, 2002). Since these tree-based methods can be used for both exploratory and predictive analyses, these methods are attractive alternatives to constrained ordination techniques (De'ath and Fabricius, 2000). The advantage of a multivariate classification tree over random forests is that it is easy to use and to interpret the graphical output. However, random forests produce lower prediction errors and are considered more robust because they attempt to reduce the variation by averaging many trees instead of choosing the best fitted tree (Simpson and Birks, 2012). Random forests are more difficult to interpret because the mechanisms behind the random forests need careful understanding and they can easily appear as a ‘black box’ (Breiman, 2001). By reducing the intercorrelation between variables, random forests may improve the accuracy of the predictions, but not necessarily their precision. Breiman (2001) and Lantz (2013) list several desirable characteristics of random forests. These include being relatively robust to outliers and noise, being computationally faster than bagging or boosting, and providing useful internal estimates of errors and information about variable importance. These characteristics are attractive when analysing large data-sets and random forests may prove to be very useful numerical tools in the future with large sub-continental or continental scale palynological data-sets. In addition, pollen samples from unknown vegetation types (e.g. fossil samples) can easily be assigned to a modern vegetation–landform type using random forests. 5. Conclusions There now exists a large number of robust numerical methods for detecting structure in multivariate palynological data. Based on our

34

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

Picea Pinus Betula Alnus Salix Gramineae Cyperaceae Chenopodiineae Ambrosieae Artemisia Rumex.Oxyria Lycopodium Ericaceae Tubuliflorae Larix Corylus Populus Quercus Fraxinus

cf,trft

cf

mfpg

trft 6.7 : n=41

3.48 : n=66

2.04 : n=24 Error : 0.639 CV Error : 0.675 SE : 0.0482

Fig. 9. Multivariate classification tree of the pollen data using the modern vegetation–landform types in 3 broad groups. The histograms show the abundant taxa within the different groups predicted by the vegetation–landform types. The cross-validated error (prediction accuracy) of the classification has increased from 0.388 (11 vegetation–landform types) to 0.675 (3 types), and a much smaller proportion of the variation in the data is captured. Abbreviations for the vegetation–landform types follow Fig. 1, the Data section in the text, and Table A1, Appendix A.

Table 2 Results from random forests when the modern vegetation–landform types are in 11 groups. The out-of-bag estimate of the prediction error is 21.37%. Vegetation–landform type abbreviations follow the Data section in the text and Table A1, Appendix A. The diagonal values in bold show perfect matches between the random forest clusters and the vegetation-landform types.

ap ccf-a ccf-b ccf-c df ft gr mf1 mf2 ocf tr

ap

ccf-a

ccf-b

ccf-c

df

ft

gr

mf1

mf2

ocf

tr

Class. error

8 0 0 0 0 0 2 0 0 0 0

0 8 0 0 0 0 0 0 1 0 0

0 0 18 0 0 1 0 1 0 1 0

0 0 1 22 0 0 0 2 1 1 0

0 0 0 0 4 0 0 0 0 0 0

0 0 1 0 0 1 0 0 0 3 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 8 1 0 0

1 0 0 0 0 0 0 1 11 0 0

0 0 2 0 0 6 0 0 0 9 0

0 0 0 0 0 2 0 0 0 0 14

0.111 0.000 0.182 0.000 0.000 0.900 1.000 0.333 0.214 0.357 0.000

Table 3 Results from random forests when the modern vegetation–landform types are in 8 groups. The out-of-bag estimate of the prediction error is 20.61%. Vegetation–landform type abbreviations follow the Data section in the text and Table A1, Appendix A. The diagonal values in bold show perfect matches between the random forest clusters and the vegetation-landform types.

ap ccf df ft gr mf ocf tr

ap

ccf

df

ft

gr

mf

ocf

tr

Class. error

8 0 0 0 2 0 0 0

0 50 0 1 0 7 2 0

0 0 4 0 0 0 0 0

0 1 0 1 0 0 3 1

0 0 0 0 0 0 0 0

1 0 0 0 0 19 0 0

0 1 0 6 0 0 9 0

0 0 0 2 0 0 0 13

0.111 0.038 0.000 0.900 1.000 0.269 0.357 0.071

Table 4 Results from random forests when the modern vegetation–landform types are in 5 groups. The out-of-bag estimate of the prediction error is 14.5%. Vegetation–landform type abbreviations follow the Data section in the text and Table A1, Appendix A. The diagonal values in bold show perfect matches between the random forest clusters and the vegetation-landform types.

cf dfapgr ft mf tr

cf

dfapgr

ft

mf

tr

Class. error

63 0 6 7 0

0 14 0 0 0

3 0 2 0 0

0 1 0 19 0

0 0 2 0 14

0.045 0.067 0.800 0.269 0.000

Table 5 Results from random forests when the modern vegetation–landform types are in 3 broad types. The out-of-bag estimate of the prediction error is 12.21%. Vegetation–landform type abbreviations follow the Data section in the text and Table A1, Appendix A. The diagonal values in bold show perfect matches between the random forest clusters and the vegetation-landform types.

cf mfapgr trft

cf

mfapgr

trft

Class. error

62 7 5

0 34 0

4 0 19

0.061 0.171 0.208

results it is not possible to recommend one method over the others. The traditional methods used in Birks et al. (1975) and Birks and Gordon (1985) are useful in the sense of providing ecologically meaningful results that are relatively easy to interpret. However, the new unsupervised clustering methods, together with the new supervised

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

0.05

Fraxinus Populus Corylus Rumex.Oxyria Salix Lycopodium Larix Pinus Ericaceae Cyperaceae Quercus Tubuliflorae Alnus Ambrosieae Betula Chenopodiineae Picea Artemisia Gramineae

Gramineae Lycopodium Tubuliflorae Fraxinus Larix Corylus Chenopodiineae Artemisia Ambrosieae Alnus Populus Rumex.Oxyria Ericaceae Salix Betula Quercus Picea Pinus Cyperaceae

0.00

ccf−c

df 0.05

Cyperaceae Rumex.Oxyria Lycopodium Larix Populus Ericaceae Salix Chenopodiineae Fraxinus Tubuliflorae Ambrosieae Alnus Quercus Betula Gramineae Corylus Artemisia Picea Pinus

Corylus Populus Lycopodium Larix Alnus Artemisia Betula Chenopodiineae Tubuliflorae Fraxinus Salix Rumex.Oxyria Quercus Ambrosieae Ericaceae Gramineae Pinus Picea Cyperaceae

0.00

mf1

0.00

0.10 0.08 0.06 0.04 0.02 0.00

ocf

0.20 0.15 0.10 0.05 0.00

mf2

tr 0.20 0.15 0.10 0.05 0.00

Corylus Rumex.Oxyria Larix Fraxinus Ambrosieae Quercus Populus Pinus Picea Salix Lycopodium Tubuliflorae Artemisia Betula Chenopodiineae Alnus Cyperaceae Gramineae Ericaceae

importance

Larix Lycopodium Cyperaceae Salix Quercus Ericaceae Corylus Tubuliflorae Gramineae Fraxinus Populus Alnus Pinus Rumex.Oxyria Ambrosieae Picea Betula Chenopodiineae Artemisia

−0.01

Populus Tubuliflorae Rumex.Oxyria Larix Fraxinus Salix Chenopodiineae Lycopodium Quercus Pinus Alnus Ericaceae Ambrosieae Picea Betula Corylus Gramineae Cyperaceae Artemisia

0.08 0.06 0.04 0.02 0.00

Alnus Ambrosieae Fraxinus Corylus L ycopodium Larix Chenopodiineae Rumex.Oxyria Artemisia Tubuliflorae Salix Quercus Gramineae Pinus Populus Picea Ericaceae Cyperaceae Betula

importance

gr 0.01

ft 0.06 0.04 0.02 0.00

0.10

0.02

Fraxinus Tubu liflorae Quercus Salix Corylus Ambrosieae P opulus Rumex.Oxyria Chenopodiineae Larix Alnus Lycopodium Artemisia Ericaceae Betula Gramineae Pinus Cyperaceae Picea

0.10

ccf−b 0.20 0.15 0.10 0.05 0.00

Lycopodium Tubuliflorae Ambrosieae Larix Fraxinus Rumex.Oxyria Populus Betula Artemisia Corylus Salix Chenopodiineae Quercus Gramineae Alnus Cyperaceae Ericaceae Picea Pinus

0.25 0.20 0.15 0.10 0.05 0.00

ccf−a 0.25 0.20 0.15 0.10 0.05 0.00

Larix Rumex.Oxyria Fraxinus Lycopodium Quercus Ambrosieae Corylus Alnus Betula Tubuliflorae Artemisia Populus Pinus Ericaceae Picea Chenopodiineae Salix Cyperaceae Gramineae

importance importance

ap 0.15

35

Fig. 10. Importance variables from random forests using the eleven vegetation–landform types. Each bar plot represents a different vegetation–landform type and the estimated importance of each variable. It is important to note that high numerical importance does not necessarily mean high numerical abundance. It reflects the relative importance of a particular pollen type in discriminating a specific vegetation–landform type. Abbreviations for the vegetation–landform types follow Fig. 1, the Data section in the text, and Table A1, Appendix A.

techniques, are, in contrast to the classical methods, more robust in a statistical sense when dealing with complex multivariate data. Several provide a basis for assessing the statistical significance of correlations between results, for assessing the number of clusters in a data-set, and for evaluating the predictive power of a data-set. Based on the results presented here and on analysis of other larger and more complex pollen data-sets (VA Felde and HJB Birks unpublished), we suggest that principal curves, multivariate classification trees, Procrustes and Protest analyses, fuzzy C-means clustering, K-means partitioning, partitioning around medoids, and random forests are valuable additions to the ever-expanding numerical tool-kit (Birks, 2013) for palynologists who want to explore and summarise

patterns and generate hypotheses relating to their hard-earned pollen-analytical data-sets. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.revpalbo.2014.06.003. Acknowledgements We are grateful to Glenn De'ath and Gavin Simpson for advice about principal curves and to Mark Hill for providing his SPHERIKM program and advice about spherical K-means clustering. We also thank Cathy Jenks for help in editing the manuscript, and Beate Helle for help with the figures. We acknowledge helpful comments from two anonymous

36

V.A. Felde et al. / Review of Palaeobotany and Palynology 210 (2014) 22–36

reviewers. This work has been supported by Miljø 2015 LAND — Terrestrial biodiversity through time — novel methods and their applications [203804/E40], funded by NFR and from the Olaf Grolle Olsen's Legacy to the University of Bergen with the addition of the bequest of Miranda Bødtker. References Andersen, S.Th, 1970. The relative pollen productivity and pollen representation of north European trees, and correction factors for tree pollen spectra. Danmarks Geologiske Undersøgelse Series II, 96, 1–99. Andersen, S.Th, 1978. Local and regional vegetational development in eastern Denmark in the Holocene. Geological Survey of Denmark. Yearbook 1976, Copenhagen, pp. 5–27. Antoniades, D., Douglas, M.S.V., Michelutti, N., Smol, J.P., 2014. Determining diatom ecotones and their relationship to terrestrial ecoregion designation in the central Canadian Arctic islands. J. Phycol. http://dx.doi.org/10.1111/jpy.12195. Austin, M.P., 2013. Inconsistencies between theory and methodology: a recurrent problem in ordination studies. J. Veg. Sci. 24, 251–268. Birks, H.J.B., 2012. Introduction and overview of part II. In: Birks, H.J.B., Lotter, A.F., Juggins, S., Smol, J.P. (Eds.), Tracking environmental change using lake sediments. vol 5: Data handling and numerical techniques. Springer, Dordrecht, pp. 101–121. Birks, H.J.B., 2013. Numerical analysis methods, In: Elias, S.A., Mock, C.M. (Eds.), 2nd edition. Encyclopedia of Quaternary Science, vol. 3. Elsevier, Amsterdam, pp. 821–830. Birks, H.J.B., 2014. Aquatic ecotones — new insights from Arctic Canada. J. Phycol. http:// dx.doi.org/10.1111/jpy.12206. Birks, H.J.B., Birks, H.H., 1980. Quaternary Palaeoecology. Edward Arnold, London. Birks, H.J.B., Gordon, A.D., 1985. Numerical Methods in Quaternary Pollen Analysis. Academic Press, London. Birks, H.J.B., Webb, T., Berti, A.A., 1975. Numerical analysis of pollen samples from central Canada: a comparison of methods. Rev. Palaeobot. Palynol. 20, 133–169. Birks, H.J.B., Lotter, A.F., Juggins, S., Smol, J.P. (Eds.), 2012. Tracking Environmental Change Using Lake Sediments, Volume 5: Data Handling and Numerical Techniques. Springer, Dordrecht. Borcard, D., Gillet, F., Legendre, P., 2011. Numerical Ecology with R. Springer, New York. Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Bryson, R.A., 1966. Air masses, streamlines and the boreal forest. Geogr. Bull. (Can.) 8, 228–269. Crawley, M.J., 2013. The R Book, 2nd edition. Wiley, Chichester. Davis, M.B., 1963. On the theory of pollen analysis. Am. J. Sci. 261, 897–912. Davis, M.B., 2000. Palynology after Y2K — understanding the source area of pollen in sediments. Annu. Rev. Earth Planet. Sci. 28, 1–18. De'ath, G., 2002. Multivariate regression trees: a new technique for modeling species—environment relationships. Ecology 83, 1108–1117. De'ath, G., 1999. Principal curves: a new technique for indirect and direct gradient analysis. Ecology 80, 2237–2253. De'ath, G., Fabricius, K.E., 2000. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81, 3178–3192. Digby, P.G.N., Kempton, R.A., 1987. Multivariate Analysis of Ecological Communities. Chapman and Hall, London. Felde, V.A., Peglar, S.M., Bjune, A.E., Grytnes, J.-A., Birks, H.J.B., 2014. The relationship between vegetation composition, vegetation zones and modern pollen assemblages in Setesdal, southern Norway. The Holocene. http://dx.doi.org/10.1177/0959683614534745. Fielding, A., 2007. Cluster and Classification Techniques for the Biosciences. Cambridge University Press, Cambridge. Gordon, A.D., 1999. Classification, 2nd edition. Chapman & Hall/CRC, Boca Raton. Gower, J.C., 1975. Generalized Procrustes analysis. Psychometrika 40, 33–51. Hastie, T., Tibshirani, R., Friedman, J., 2011a. The Elements of Statistical Learning, 2nd edition. Springer, New York. Hastie, T., De'ath, G., Walsh, C., 2011b. pcurve: Principal Curve Analysis. http://CRAN.Rproject.org/package=pcurve. Heegaard, E., Lotter, A.F., Birks, H.J.B., 2006. Aquatic biota and the detection of climate change: are there consistent aquatic ecotones? J. Paleolimnol. 35, 507–518. Herzschuh, U., Birks, H.J.B., 2010. Evaluating the indicator value of Tibetan pollen taxa for modern vegetation and climate. Rev. Palaeobot. Palynol. 160, 197–208. Hill, M.O., 2012. Program SPHERIKM for Spherical k-means Clustering. Centre for Ecology and Hydrology, Wallingford. Hill, M.O., Šmilauer, P., 2005. TWINSPAN for Windows Version 2.3. Centre for Ecology and Hydrology, Wallingford. Hill, M.O., Harrower, C.A., Preston, C.D., 2013. Spherical k-means clustering is good for interpreting multivariate species occurrence data. Methods Ecol. Evol. 4, 542–5551. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2, 193–218. Jackson, D.A., 1995. PROTEST: a Procrustean randomization test of community environment concordance. Ecoscience 2, 297–303. James, G., Witten, D., Hastie, T., Tibshirani, R., 2014. An Introduction to Statistical Modelling. Springer, New York.

Janssen, C.R., 1973. Local and regional pollen deposition. In: Birks, H.J.B., West, R.G. (Eds.), Quaternary Plant Ecology. Blackwell Scientific Publications, Oxford, pp. 31–42. Janssen, C.R., 1981. On the reconstruction of past vegetation by pollen analysis: a review. Proceedings of the IVth International Palynological Conference, Lucknow (1976– 1977), 3, pp. 3, 163–3, 172. Juggins, S., 2013. rioja: Analysis of Quaternary Science Data. Version 0.8-5. http://cran.rproject.org/web/packages/rioja/index.html. Lantz, B., 2013. Machine Learning with R. Packt Publishing, Birmingham. Legendre, P., Birks, H.J.B., 2012a. Clustering and partitioning. In: Birks, H.J.B., Lotter, A.F., Juggins, S., Smol, J.P. (Eds.), Tracking environmental change using lake sediments. vol 5: Data handling and numerical techniques. Springer, Dordrecht, pp. 167–200. Legendre, P., Birks, H.J.B., 2012b. From classical to canonical ordination. In: Birks, H.J.B., Lotter, A.F., Juggins, S., Smol, J.P. (Eds.), Tracking environmental change using lake sediments. vol 5: Data handling and numerical techniques. Springer, Dordrecht, pp. 201–248. Legendre, P., Legendre, L., 2012. Numerical Ecology, 3rd English edition. Elsevier, Amsterdam. Leisch, F., 2006. A toolbox for K-centroids cluster analysis. Comput. Stat. Data Anal. 51, 526–544. Liaw, A., Wiener, M., 2002. Classification and regression by randomForest. R News 2, 18–22. Lichti-Federovich, S., Ritchie, J.C., 1968. Recent pollen assemblages from the western interior of Canada. Rev. Palaeobot. Palynol. 7, 297–344. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K., 2013. cluster: Cluster Analysis Basics and Extensions. Version 1.14.4. http://cran.r-project.org/web/packages/ cluster/index.html. Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179. Minchin, P.R., 1987. An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio 69, 89–107. Mirkin, B., 2011. Choosing the number of clusters. Wiley Interdiscip. Rev. Data Min. Knowl. Disc. 1, 252–260. Oksanen, J., Blanchet, G.F., Kindt, R., Legendre, P., Minchin, P.R., O'Hara, R.B., Simpson, G.L., Solymos, P., Stevens, H.H., Wagner, H., 2013. vegan: Community Ecology Package. Version 2.0-10. http://cran.r-project.org/web/packages/vegan/index.html. Pelánková, B., Kuneš, P., Chytrý, M., Jankovská, V., Ermakov, N., Svobodová-Svitavská, H., 2008. The relationship of modern pollen spectra to vegetation and climate along a steppe-forest–tundra transition in southern Siberia, explored by decision trees. The Holocene 18, 1259–1271. Peres-Neto, P., Jackson, D., 2001. How well do multivariate data sets match? The advantages of a Procrustean superimposition approach over the Mantel test. Oecologia 129, 169–178. Prentice, I.C., 1980. Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Rev. Palaeobot. Palynol. 31, 71–104. Preston, C.D., Hill, M.O., Harrower, C.A., Dines, T.D., 2013. Biogeographical patterns in the British and Irish flora. New J. Bot. 3, 96–116. R Core Team, 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (http://www.r-project.org/). Ritchie, J.C., 1962. A geobotanical survey of northern Manitoba. Arctic Inst. North Am. Tech. Bull. 9, 1–47. Shi, G.R., 1993. Multivariate data analysis in palaeoecology and palaeobiogeography—a review. Palaeogeogr. Palaeoclimatol. Palaeoecol. 105, 199–234. Simpson, G.L., Birks, H.J.B., 2012. Statistical learning in palaeolimnology. In: Birks, H.J.B., Lotter, A.F., Juggins, S., Smol, J.P. (Eds.), Tracking environmental change using lake sediments. vol 5: Data handling and numerical techniques. Springer, Dordrecht, pp. 249–327. Šmilauer, P., Lepš, J., 2014. Multivariate Analysis of Ecological Data Using Canoco 5, 2nd edition. Cambridge University Press, Cambridge. Sugita, S., 1994. Pollen representation of vegetation in Quaternary sediments: theory and method in patchy vegetation. J. Ecol. 82, 881–897. Sugita, S., 2007a. Theory of quantitative reconstruction of vegetation I. Pollen from large sites REVEALS regional vegetation composition. The Holocene 17, 229–241. Sugita, S., 2007b. Theory of quantitative reconstruction of vegetation II. All you need is LOVE. The Holocene 17, 243–257. ter Braak, C.J.F., 1986. Canonical correspondence analysis: a new eigenvector method for multivariate direct gradient analysis. Ecology 67, 1167–1179. ter Braak, C.J.F., Prentice, I.C., 1988. A theory of gradient analysis. Adv. Ecol. Res. 18, 271–317. ter Braak, C.J.F., Šmilauer, P., 2012. Canoco Reference Manual and User's Guide: Software for Ordination (Version 5.0). Microcomputer Power, Ithaca, New York. ter Braak, C.J.F., Verdonschot, P.F.M., 1995. Canonical correspondence analysis and related methods in aquatic ecology. Aquat. Sci. 57, 255–289. Therneau, T.M., et al., 2013. mvpart: Multivariate Partitioning. (rpart by Terry M Therneau, Beth Atkinson). R Port of rpart by Brian Ripley. Some Routines from Vegan — Jari Oksanen. Extensions and Adaptations of rpart to mvpart by Glenn De'ath, (2013)R package version 1.6-1. http://CRAN.R-project.org/package=mvpart. Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat Methodol.) 63, 411–423. Wehrens, R., 2011. Chemometrics with R. Springer, New York.