ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs
Exploring diversity in ensemble classification: Applications in large area land cover mapping Andrew Mellor a,⇑, Samia Boukir b a b
School of Mathematical and Geospatial Sciences, RMIT University, Melbourne, VIC 3001, Australia Bordeaux INP, G&E, EA 4592, F-33600 Pessac, France
a r t i c l e
i n f o
Article history: Received 27 November 2016 Received in revised form 25 April 2017 Accepted 25 April 2017
Keywords: Diversity Ensemble Margin Random forests Training data Classification Land cover Remote sensing
a b s t r a c t Ensemble classifiers, such as random forests, are now commonly applied in the field of remote sensing, and have been shown to perform better than single classifier systems, resulting in reduced generalisation error. Diversity across the members of ensemble classifiers is known to have a strong influence on classification performance - whereby classifier errors are uncorrelated and more uniformly distributed across ensemble members. The relationship between ensemble diversity and classification performance has not yet been fully explored in the fields of information science and machine learning and has never been examined in the field of remote sensing. This study is a novel exploration of ensemble diversity and its link to classification performance, applied to a multi-class canopy cover classification problem using random forests and multisource remote sensing and ancillary GIS data, across seven million hectares of diverse dry-sclerophyll dominated public forests in Victoria Australia. A particular emphasis is placed on analysing the relationship between ensemble diversity and ensemble margin - two key concepts in ensemble learning. The main novelty of our work is on boosting diversity by emphasizing the contribution of lower margin instances used in the learning process. Exploring the influence of tree pruning on diversity is also a new empirical analysis that contributes to a better understanding of ensemble performance. Results reveal insights into the trade-off between ensemble classification accuracy and diversity, and through the ensemble margin, demonstrate how inducing diversity by targeting lower margin training samples is a means of achieving better classifier performance for more difficult or rarer classes and reducing information redundancy in classification problems. Our findings inform strategies for collecting training data and designing and parameterising ensemble classifiers, such as random forests. This is particularly important in large area remote sensing applications, for which training data is costly and resource intensive to collect. Ó 2017 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
1. Introduction Across a broad range of applications, ensemble classification systems (also known as multiple or committee classifiers) have been shown to produce better results than single expert systems (Polikar, 2006) and achieve reduced generalisation error (Opitz and Maclin, 1999; Tumer and Ghosh, 1996). In remote sensing application areas, such as ecology and natural resource management, ensemble classifiers, like Random Forests (RF) (Breiman, 2001), have become increasingly popular. Incorporating remote sensing data and ancillary continuous and categorical biophysical
⇑ Corresponding author. E-mail addresses:
[email protected] (A. Mellor),
[email protected] (S. Boukir).
spatial data, RF has been applied in a variety of large area land cover (Rodriguez-Galiano et al., 2012) and forest attribution studies, including biomass (Baccini et al., 2008), canopy height (Wilkes et al., 2015), canopy cover (Mellor et al., 2015) and species (Dalponte et al., 2013; Evans and Cushman, 2009). The RF classifier builds an ensemble of decision trees (known as base classifiers or ensemble members) and assigns classification through voting or averaging among these ensemble members. Diversity between ensemble members is considered a key factor affecting overall classification performance (Ham et al., 2005; Kapp et al., 2007; Kuncheva and Whitaker, 2003; Melville and Mooney, 2005). Ensemble classifiers which achieve higher overall classification rates are those in which misclassified instances (errors) made by ensemble members are uncorrelated (Banfield et al., 2005; Elghazel et al., 2011). Ensemble classifiers are often
http://dx.doi.org/10.1016/j.isprsjprs.2017.04.017 0924-2716/Ó 2017 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
152
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
more accurate than their component (base) classifiers, and diversity is greater, if errors made by ensemble members are uncorrelated (Díez-Pastor et al., 2015; Hansen and Salamon, 1990) and more uniformly distributed (Banfield et al., 2005). While ensemble diversity has been studied in the fields of information science and machine learning, to the best of our knowledge, the relationship between ensemble diversity and classification performance has not been actively explored in remote sensing. Gaining a greater insight into the role of diversity in ensemble classification is important, not least because of the increasing popularity of ensemble classifiers, such as random forests in this field (Belgiu and Dra˘gutß, 2016). Moreover, while advances in remote sensing science and technology (such as new sensors and image analysis techniques) seek to address land cover mapping (classification) error, the availability of suitable reference (training and test) data is a fundamental requirement in supervised image classification (Foody et al., 2016). Training and test data are also expensive (Pflugmacher et al., 2012), and as such, there are significant benefits to designing classifiers which make more efficient use of training data, such as reducing class information redundancy and maximizing the application of training data for classes which are rarer or more difficult to classify. In this paper, we explore the relationship between ensemble diversity and classification performance in the context of large area land cover classification across complex forest ecosystems and topography, using remote sensing and ancillary spatial data. We focus on the relationship between ensemble diversity and ensemble margin, two fundamental theories in ensemble learning. Applying the RF classifier, we evaluate different ways of inducing diversity in ensemble classification to improve classification performance and efficiency, and reduce training data redundancy. The main novelty of our work is on boosting diversity by targeting lower margin training samples (which represent class decision boundaries or more difficult or rarer classes) in the learning process. We also propose a new empirical analysis that explores the influence of tree pruning, and decision tree depth, on diversity, which leads to a better understanding of RF classifier performance. The findings of this work may be used to inform training data collection strategies and to design more efficient classification. Key concepts used in the paper are introduced in sections II through IV. Section V describes the study area and data, and experiments, results and discussion are included in sections VI through VIII. 2. Random forests Random forests (Breiman, 2001) is a popular ensemble classifier (Belgiu and Dra˘gutß, 2016), which generates decision trees using sub-sets of bootstrap-aggregated training data (sampling with replacement), otherwise known as bagging. These decision trees represent diverse base classifiers, which are combined into an ensemble. In addition to bagging, diversity is induced through the random selection of a sub-set of input (explanatory or predictor) variables which are evaluated for partitioning data at each decision tree node (Elghazel et al., 2011). A response variable is predicted as a modal vote (for categorical data) or average (for continuous variables) among the ensemble decision trees. Studies have reported that the number of variables randomly sampled to split training data at decision tree nodes does not affect classification rates (and other RF performance measures) (Cutler et al., 2007). 3. Ensemble margin The margin provides a measure of confidence in ensemble classification (Guo et al., 2011; Mellor et al., 2014, 2015) and is an
important concept in ensemble methods (Schapire and Freund, 1998). The ensemble margin is calculated as the difference between the number of votes assigned to different classes by the base classifiers in an ensemble. The unsupervised version of Schapire’s margin (Eq. (1)) of a sample x is the difference between the number of votes (respectively V c1 and V c2 Þ assigned to the first and second most popular classes (respectively c1 and c2 Þ, normalised by the number of base classifiers (T) in the ensemble, regardless of true class labels (Guo and Boukir, 2013). It has been used in large area remote sensing classification as an ancillary measure of random forest classifier performance (Mellor et al., 2014, 2015).
marginðxÞ ¼
V c1 V c2 ; 0 6 marginðxÞ 6 1 T
ð1Þ
Correctly classified training instances with high margin values (i.e. close to 1) represent instances located away from class decision boundaries and can contain a high degree of redundant information in a classification problem. Conversely, training instances with low margin values (i.e. close to 0) are located near decision boundaries and are more informative in a classification task. Unlike Schapire’s margin (Schapire and Freund, 1998), which is supervised and calculated as the difference between votes assigned to the true class and those assigned to the most voted class that is different from the true class, class labels in the unsupervised margin (Guo and Boukir, 2013) (applied in this study) are not of significance. As such, the unsupervised margin may be more robust to noise (Guo, 2011). The mean margin (Eq. (2)) is a descriptive statistic for the ensemble margin, calculated from the unsupervised margin values (Eq. (1)), which can be used as a confidence measure for model performance (Mellor et al., 2014, 2015). This measure ranges from 1 (weakest ensemble classifier) to +1 (strongest ensemble classifier).
l¼
ðnc lc Þ ðnm lm Þ ; nc þ nm
1 6
l 6 1
ð2Þ
where nc is the number of correctly classified instances, nm is the number of misclassified instances, lc and lm are mean margins for correctly and misclassified instances respectively.
4. Ensemble diversity Ensemble diversity is important for majority vote accuracy and aims at decreasing the probability of identical errors (correlation between ensemble members). While it is accepted that diversity improves overall ensemble classification performance, there is no general agreement on how it should be quantified or dealt with (Kapp et al., 2007), nor is there a widely perceived concept of diversity or theoretical framework which supports the development of methods to capture diversity among classifiers (Bi, 2012). A review by Kuncheva and Whitaker (Kuncheva and Whitaker, 2003) compared ten measures of pairwise and non pair-wise diversity, finding most to be highly correlated. In pairwise measures, the diversity values between all pairs of classifiers are initially calculated. The overall diversity measure value is then computed as the mean of all pair-wise values. Unlike pairwise measures, nonpairwise measures are calculated by counting a statistical value of all ensemble classifiers to measure the whole diversity. Therefore they generally run much faster than pairwise measures (Guo, 2011). Diversity can be measured at the output (prediction) level, the input (training data) level and at the structure or parameter level (Guo and Boukir, 2014). In this study, we measure diversity at the output level (i.e. diversity among the class labels assigned across each of the base classifiers in the ensemble), using KW (Kohavi and Wolpert) variance (Kohavi and Wolpert, 1996), a
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
popular non-pairwise diversity measure, which can be expressed as Eq. (3) (Kapp et al., 2007).
KW ¼
N 1 X tðxj ÞðT tðxj ÞÞ; 2 NT j¼1
0 6 KW 6 0:25
ð3Þ
where diversity increases with KW variance, T is the size of the ensemble of classifiers, tðxj Þ is the number of classifiers that correctly recognise sample xj , and N represents the number of samples. The minimum value for KW diversity is 0 (lowest diversity), which occurs when all the T ensemble members correctly classify all of the samples (overall accuracy of 100% and mean margin l of 1), or conversely, when all of the T ensemble members misclassify all of the samples (overall accuracy of 0% and negative mean margins l ranging from 1, in binary classification, to 0). KW Diversity is maximised (KW = 0.25) when half of the T ensemble members correctly classify each of the samples. The mean margin l ranges from 0 (in the case of binary classification) to 0.5. In this case, underlying events are equiprobable i.e. the probability of an instance being correctly classified and misclassified are the same, such as in random prediction. A good diversity measure would have the ability to find the extent of diversity among classifiers and estimate the improvement or deterioration in accuracy of individual classifiers when they have been combined (Bi, 2012). An optimal ensemble classifier achieves the right balance between the accuracy of base classifiers and the diversity of the ensemble. Over-fitting can occur if diversity is too low and there is too much correlation between base classifiers. Too much diversity however, can reduce the accuracy of the ensemble. For example, an ensemble classifier with random prediction has the highest diversity but the lowest accuracy. This accuracy-diversity trade-off will be investigated in this study. An emphasis is placed on analysing the relationship between diversity and ensemble margin which play a key role in majority vote performance. 5. Study area and data The experiments study area covers about seven million hectares of diverse dry-sclerophyll dominated public forests in Victoria, Australia. This area is characterised by varied topography and a range of climate zones. Classification predictor variables include remote sensing data (Landsat TM and MODIS), derived texture indices, elevation, slope, aspect and biophysical climate data. Landsat TM data – frequently applied in studies for forest type mapping and canopy cover assessment (e.g. Boyd and Danson, 2005) – comprises a mosaic of nineteen scenes, captured between February and March 2009, coinciding with the time of training and test data land cover mapping. High sun angles during the summer period of Landsat data acquisition minimised shadow and terrain artifacts in the imagery, and enhanced spectral differences between overstorey evergreen vegetation and more seasonally dynamic understory vegetation (Mellor et al., 2013). Landsat TM scenes were processed to standardised surface reflectance (Flood et al., 2013), reducing inter-scene variation due to atmospheric conditions, topography, sun angle and sensor location. A single standard deviation raster surface was extracted from an annual twenty-three image multi-temporal stack of 16-day MODIS NDVI mosaics (Paget and King, 2008) – this was used to represent phenological variance over a calendar year across the study area. To characterise vegetation regions which can appear spectrally similar, but have different spatial patterns, textural indices were included as variables in the model. Texture indices have been shown to improve classification performance (Kayitakire et al., 2006; Rodríguez-Galiano et al., 2011). First order texture measures of variance and entropy (Haralich, 1979) were generated for 3 3
153
and 5 5 cell neighbourhood moving windows, from a grey-scaled (8-bit) Landsat TM derived Normalised Difference Vegetation Index (NDVI). Textural indices were designed to capture textural variance of the study area’s forested ecosystems (Mellor et al., 2013). Topographic and biophysical data were used in the classifier to capture species-environmental relationships, which are key information to geographical modeling (Guisan and Zimmermann, 2000). Vegetation composition is expected to occur in locations with similar soils, topography and climate (Franklin, 1995), and bioclimatic maps provide information about the climatic influence on the distribution of different forest types (Beaumont et al., 2005). Elevation, slope and aspect data were derived from a 30 m Digital Elevation Model (DEM) (CSIRO, 2011). The DEM was also used to generate precipitation, temperature, radiation and moisture climate prediction surfaces using BIOCLIM in the ANUCLIM (v 5.1) software package (Houlder, 2001) – a description of the BIOCLIM process can be found in Beaumont et al. (2005). Classification reference (training and test) data were derived from seven hundred and sixty-six 2 2 km digital aerial photograph interpreted (API) land cover maps, systematically distributed across a state-wide random stratified grid (Fig. 1) from imagery acquired between 2006 and 2010. Trained interpreters delineated land cover classes based on information which included crown-shape, colour, shadow and size. A land cover classification system was applied based on Mellor and Haywood (2010), which included broad forest or other land cover types, three forest canopy height classes (low, medium and tall) and three canopy cover classes (woodland, open and closed). The forest definition applied followed the Australian National Forest Inventory (Department of Agriculture Fisheries and Forestry, 2012), whereby forest is defined as having a greater than 20% crown cover and a minimum stand height of two metres. A half hectare minimum mapping unit was also applied to land cover maps, following UNFAO forest definition (Food and Agriculture Organization of the United Nations, 2001). A detailed description of the land cover reference data methodology can be found in Farmer et al. (2013). For this study, land cover data were aggregated into three broad canopy cover classes (woodland, open, closed) and two non-forest classes (shrub and non-forest). Examples of canopy cover classes in aerial photography are shown in Fig. 2. Land cover polygons were converted to raster and combined with the classification predictor variables. Following Mellor et al. (2015), reference data were divided into training and test subsets, comprising 100,000 (20,000 per class) and 25,000 (5000 per class) samples respectively.
6. Experiments Three experiments were performed using the RF algorithm and assessed using measures of overall and per-class accuracies, Kappa coefficient, ensemble margin and KW diversity. The experiments were designed to explore the influence of, and relationship between, ensemble diversity and classification performance. The main originality of this empirical analysis lies in how the ensemble margin is explicitly involved in the learning process, to induce greater diversity in the ensemble and influence its performance. The randomForest package (Liaw and Wiener, 2002) in R (R Development Core Team, 2011) was used to build the RF models and run experiments. Following our previous work (Mellor et al., 2015), 150 base classifiers (decision trees) were used in each experiment. Training data were used to calculate unsupervised margin values then mean margin. Test data were used to calculate RF model overall and per-class accuracies, Kappa statistic and KW diversity. Overall accuracy was first calculated for each individual
154
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
Fig. 1. Study area map: Victorian Interim Biogeographic Regionalisation for Australia (IBRA Bioregions) and Aerial Photographic Interpretation (API) land cover maps.
ensemble base classifier before being combined to calculate ensemble accuracy, ensemble margin and KW diversity for the ensemble. To more clearly illustrate results, all diversity values were normalised, to range from 0 to 1. Calculated Kappa coefficients (Carletta, 1996) also range from 0 to 1. 6.1. Experiment 1: Influence of the number of predictor variables on diversity and margin The number of variables randomly sampled as candidates to partition training data at each decision tree node (hereafter referred to as mtry from the randomForest R package) was adjusted to evaluate the parameter’s effect on classification performance and diversity. For this experiment, starting with two, mtry was increased (in single increments) for each RF ensemble model, up to 17 (the maximum number of predictor variables available). Classification accuracy, Kappa statistic, mean margin and KW diversity were calculated for each ensemble. 6.2. Experiment 2: Training margins and high diversity data selection The second experiment constitutes the major contribution of this exploration of ensemble diversity - by investigating a new means of inducing diversity in ensemble learning. This consists of emphasizing the role of lower margin samples in the learning process at the expense of highest margin samples, the latter having the least influence on diversity and ensemble classification performance. For this experiment (Fig. 3), the unsupervised margin (Eq. (1)) was first calculated for each training data instance as the dif-
ference between the maximum number of decision tree (ensemble member) votes assigned to a class minus the number of votes assigned to the second most voted for class, by the ensemble. Percentile distributions were then calculated from the unsupervised margin values of the training set. RF classifications were run on sub-sets of the original training set using only training instances in the bottom (lowest margins) and top (highest margins) 50th, 60th, 70th, 80th and 90th percentiles to build RF models, as well as all training instances. For each ensemble, the mean of the individual ensemble members overall and per-class accuracy and Kappa statistic, the ensemble overall and per-class accuracy and Kappa statistic, and KW diversity, were calculated. These results were compared to ensemble classifiers generated using random subsets (50%, 60%, 70%, 80% and 90%) of all available training instances.
6.3. Experiment 3: Influence of the minimum node size on diversity The last original empirical analysis aims to investigate the influence of tree pruning (and therefore decision tree depth) on diversity for a better understanding of ensemble performance in general, and RF performance in particular. The minimum node size is a model parameter used to control the minimum size of terminal nodes in each decision tree, and therefore, the depth of decision trees. By default in the RF package (and the other experiments applied in this study), the minimum node size is set to 1. In this experiment (Fig. 4), the minimum node size was increased for each RF ensemble model (from 1 up to 250) and ensemble and mean
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
a) Woodland
b) Open
c) Closed
d) Shrub
155
Fig. 2. Aerial photography examples of forest canopy cover used in the multiclass classification (a) Woodland, 20–50% canopy cover; b) Open, 51–80% canopy cover; c) Closed, >80% canopy cover; d) Shrub (land cover dominated by woody vegetation shrub species, up to 2 m in height).
base classifier accuracies, Kappa statistics and diversity were calculated for each.
7. Results and discussion 7.1. Influence of the number of predictor variables on diversity and margin Fig. 5 and Table 1 show the results of experiment 1. These results show that diversity decreases as the number of predictor variables selected for decision tree splitting (mtry) increases. Indeed, the fewer the variables assessed for node splitting, the greater the amount of introduced uncertainty (as shown by the mean margin) and the higher the diversity achieved. Increasing the number of predictor variables assessed at each node split increases classification confidence (Guo and Boukir, 2014). The ensemble and mean individual decision tree classification accuracies increase marginally with increasing mtry. Above an mtry value of 5, overall ensemble and mean base classifier accuracies are stable (83.0%, 0.79 Kappa, and 70.2%, 0.63 Kappa respectively).
Note that a standard RF model would use 4 node split variables pffiffiffiffiffiffi (mtry = 17), which, applied here, does not result in the highest overall ensemble classification accuracy.. Overall classification accuracy and Kappa coefficient by mtry are shown in Table 1. While the mean single tree accuracy is reduced with less variables (and uncertainty is higher), the difference between overall (ensemble) and single tree accuracies is greater for 2 variables than for the maximum 17 variables (15.5% and 12% respectively). This illustrates how a loss in tree accuracy and uncertainty associated with a low number of variables is compensated for by higher diversity which significantly influences classification performance. 7.2. Training margins and high diversity data selection Figs. 6–8 show results from experiment 2, the mean base classifier accuracy, ensemble accuracy and normalised KW diversity as a function of training set size, selected by training instances in the bottom (lowest margins) and top (highest margins) 50th to 90th percentiles, and randomly selected training instances (equivalent proportions of the total training set). The x-axis on Figs. 6–8 ranges
156
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
Fig. 3. Flow chart illustrating training margins experiment (2).
Fig. 4. Flow chart illustrating minimum node size experiment (3).
from 50 to 100, and represents the margin percentile (in the case of margin-based training data selection), and the proportion of the training set size (in the case of random training data selection). For example, the bottom 50th margin percentile training data sub-set is the same size as the randomly sampled 50% training set. Table 2 shows mean tree and ensemble accuracies (%) and Kappa results for the training margin experiments.
Lower margin models (using training samples with margin values in the bottom 50th, 60th, 70th, 80th and 90th percentiles) result in lower mean decision tree accuracies compared to higher margin models (using training samples with margin values in the top 50th to 90th percentiles) (Fig. 6). This is especially true when comparing the top and bottom training instance margin models in the 50th to 70th percentile range.
157
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
1.00
Accuracy / Mean Margin / KW Diversity
0.95
Ensemble Accuracy
Mean tree accuracy
Mean Margin
KW Diversity
0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Number of node split variables (mtry) Fig. 5. Ensemble and mean base classifier accuracies, mean margin and KW diversity plotted against mtry.
Table 1 Mean tree, ensemble accuracies (%) and Kappa statistic results for the number of predictor variables experiment. Mtry
Mean tree accuracy (%)
Mean tree kappa
Ensemble accuracy (%)
Ensemble kappa
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
65.48 67.11 68.13 68.65 69.21 69.39 69.74 69.87 70.11 70.13 70.33 70.34 70.48 70.58 70.65 70.63
0.57 0.59 0.60 0.61 0.61 0.62 0.62 0.62 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63
80.95 81.82 82.21 82.43 82.93 82.88 83.00 83.11 83.14 83.21 83.10 82.94 82.92 82.89 82.70 82.84
0.76 0.77 0.78 0.78 0.78 0.78 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79
85
70 65 60 55
Lowest Margins
50
Highest Margins
45
Random Sampling
Ensemble Accuracy (%)
Mean Tree Accuracy (%)
75
80
75 Lowest Margins 70
Highest Margins Random Sampling
40 50
60
70
80
90
100
Training set size (%) / margin percenle
65 50
60
70
80
90
100
Training set size (%) / margin percenle
Fig. 6. Mean tree accuracy as a function of training set size by lowest and highest unsupervised margins, and random sampling.
Fig. 7. Ensemble accuracy as a function of training set size by lowest and highest unsupervised margins, and random sampling.
Highest margin generated models (50th to 90th percentiles) exhibit the highest mean tree accuracy (Fig. 6), but apart from the 50th margin percentile case, return the poorest ensemble accuracies compared to equivalent training set size models from bottom margin percentiles and random sampling (Fig. 7). It is worth highlighting that for the 70th lowest margin percentile, the overall
accuracy achieved is the same as that of the entire training set. Hence, the 30% highest margin samples that have been discarded from the training set are redundant. Redundancy not only slows down the training task, it also weakens bagging performance, affecting the rarer and most difficult classes. The lowest margin training sample selection approach minimises data redundancy.
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
KW Diversity
1
Lowest Margins
0.9
Highest Margins
0.8
Random Sampling
0.7 0.6 0.5 0.4 0.3 0.2 50
60
70
80
90
100
Training set size (%) / margin percenle Fig. 8. Ensemble KW diversity as a function of training set size by lowest and highest unsupervised margins, and random sampling.
Models generated from training instances in the bottom 70th, 80th and 90th margin percentiles achieve the best ensemble accuracy (Fig. 7). Fig. 8 shows that low margin sampling models also exhibit the highest diversity (close to maximum diversity for the 50th lowest percentile) compared to random and highest margin sampling models. Diversity for lowest margins and random sampling converge at the 90th lowest percentile and 90% training set size models. The strength of the RF ensemble bagging approach to induce diversity is underscored by the relative stability of the mean tree (Fig. 6) and ensemble accuracy curves (Fig. 7) for random sampling models by training set size, even when only half of the training data are used, particularly in comparison to the low and high margin sampling cases. Indeed, bootstrap sampling (Efron and Tibshirani, 1994) is a robust and effective approach that is suitable for small datasets. These results, comparing two opposite margin sampling strategies, show that targeting lower margin training data (which represent samples closer to class boundaries and/or more difficult than higher margin samples) is a means of inducing further diversity among decision trees in an ensemble classifier. The low margin sampling selection strategy (targeting more class decision boundary, difficult and rare class examples) while decreasing mean tree accuracy, demonstrates improved ensemble performance induced by the underlying increase in ensemble diversity.
The effect of low margin sampling is even more pronounced when looking at ensemble accuracy results for only the open canopy class (the most challenging class to classify) (Fig. 9). Unsurprisingly, this class returns its highest accuracy (74%) in the bottom 50th percentile margins model and its lowest accuracy (53%) in the top 50th percentile margins model. Furthermore, there is a greater than 5% increase in accuracy between lowest margin and random sampling for a 50% training set size. Indeed, open canopy has the highest proportion of low margin samples (Fig. 11). Consequently, as a hard or rare class, it is favoured by an approach which promotes the selection of lower margin training data. This strategy reduces data redundancy and increases information significance (e.g. class decision boundary instances are more informative) and therefore designs stronger classifiers with an increased capability for handling hard or rare classes. Classes which are more challenging to predict, such as the open canopy class, may be more commonly misidentified (as woodland or shrub for example) than more easily distinguishable forest canopy classes (e.g. the closed canopy class – which has the lowest proportion of low margin samples among the forest canopy classes – Fig. 11). Reducing the dominance of highest margin instances in the training dataset may be a strategy to increase ensemble diversity, whereby bagging samples used to construct each decision tree are themselves more diverse, through the inclusion of more
80
Ensemble Accuracy (%)
158
75 70 65 60
Lowest Margins Highest Margins
55
Random Sampling
50 50
60
70
80
90
100
Training set size (%) / margin percenle Fig. 9. Ensemble accuracy for the open canopy class as a function of training set size by lowest and highest unsupervised margins, and random sampling.
Table 2 Mean tree and ensemble accuracies (%), and Kappa statistic results for the training margin experiments. Margin percentile
Mean tree accuracy (%)
Mean tree kappa
Ensemble accuracy (%)
Ensemble kappa
Bottom 50th 60th 70th 80th 90th
45.71 54.41 60.80 66.64 68.08
0.33 0.44 0.52 0.58 0.60
66.87 78.96 81.94 82.26 82.15
0.60 0.73 0.78 0.78 0.78
Top 50th 60th 70th 80th 90th
68.37 69.98 71.04 71.41 70.92
0.60 0.62 0.64 0.64 0.63
72.76 75.01 76.78 78.72 80.59
0.66 0.68 0.71 0.73 0.76
Random sampling (%) 50 60 70 80 90 100
64.93 65.76 66.45 67.08 67.45 68.10
0.56 0.57 0.58 0.59 0.60 0.60
79.36 80.24 80.85 81.32 81.54 82.06
0.74 0.75 0.76 0.77 0.77 0.78
159
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161 Mean tree accuracy
85
0.7
Ensemble Accuracy
0.6
KW Diversity
85
80 80 Lowest Margins
75
Highest Margins Random Sampling
0.5 0.4
75 0.3
KW Diversity
Accuracy (%)
Ensemble Accuracy (%)
90
0.2
70
70 50
60
70
80
90
100
0.1
Training set size (%) / margin percenle 65
0 1
Fig. 10. Ensemble accuracy for the closed canopy class as a function of training set size by lowest and highest unsupervised margins, and random sampling.
7
15
30
50
100
250
Minimum node size Fig. 12. Ensemble and mean base classifier accuracies and KW diversity as a function of minimum node size.
Proporon of training Samples
100%
Non-forest
90% 80%
Shrub
70% 60%
Closed canopy
50% 40%
tion of low and high margin training samples will benefit harder classes while maintaining, or even improving, the classification performance of easier classes. As Fig. 10 shows, from the 60th lowest margin percentile, the ensemble accuracy is increased slightly for the closed canopy class, compared to using all of the training data.
Open Canopy
30%
7.3. Influence of the minimum node size on diversity
20%
Woodland
10% 0% Lowest 50th
Lowest 60th
Lowest 70th
Lowest 80th
Lowest 90th
All data
Training set margin percenle Fig. 11. Proportion of training samples by class and lowest unsupervised margins by percentile.
instances close to class decision boundaries and more hard class examples. However, an important reduction in the proportion of higher margin instances in the training set would affect the ensemble classifier performance on easier classes, such as closed canopy, whose loss in accuracy is about 10% in the bottom 50th percentile margin model (Fig. 10), while this model allows the hardest class (open canopy) to achieve its highest accuracy. This poor ensemble per-class accuracy is associated with relative training data imbalance for the pair closed/open canopies of about 40–60% (Fig. 11) – an increase of 10% for the hardest class and a decrease of 10% for the easiest class compared to the balanced case, as well as a reduction in training set size of half of the original set. This result is consistent with the pairwise (open/closed canopies) class imbalance experiment results, involving random sampling, reported in our previous work (Mellor et al., 2015). A trade-off in the propor-
Results of the minimum node size experiment (Fig. 12 and Table 3) reveal ensemble accuracy to be highest where decision trees are grown to their greatest depth (minimum node size of 1), such as in RF ensembles which use unpruned trees. Decreasing ensemble diversity is associated with lower ensemble accuracy and increasing minimum node size (shallower decision trees). Mean tree accuracy is relatively stable for minimum node size under 50. Hence, the loss in ensemble accuracy in this range is mainly due to the loss in diversity. A minimum node size over 50 also affects mean tree accuracy and therefore induces a steeper drop in ensemble accuracy. Indeed, the generalisation error can be attributed to the combination of the precision of base classifiers and a relative diversity between them (Kapp et al., 2007). While these results demonstrate the relationship between diversity across decision trees and ensemble accuracy, deeper trees mean more complex decision rules which can result in overfitting - particularly if trees are permitted to split down to a single observation. 8. Conclusion The results of these experiments provide insights into the relationship between ensemble diversity and classification performance, in a large area classification problem context using the random forest ensemble classifier. Investigating the effect of the number of decision tree splitting variables on classification and
Table 3 Mean tree and ensemble accuracies (%) and Kappa statistic results for the minimum node size experiment. Minimum node size
Mean tree accuracy (%)
Mean tree kappa
Ensemble accuracy (%)
Ensemble kappa
1 7 15 30 50 100 250
68.08 67.83 68.08 68.50 68.37 67.17 65.14
0.62 0.63 0.62 0.62 0.62 0.62 0.62
82.18 81.64 80.60 78.90 76.99 74.06 71.06
0.78 0.77 0.76 0.73 0.71 0.68 0.64
160
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161
performance showed how lower single tree classification performance (both accuracy and uncertainty) associated with fewer splitting variables is compensated for by higher ensemble diversity, significantly influencing ensemble classification performance. Targeting lower margin training samples (which represent class decision boundaries or more difficult or rarer classes), is a way to increase uncertainty and consequently induce diversity in ensemble learning – a strategy which reduces data redundancy and increases the significance of training information. In the context of large area remote sensing classification, where reference data can be expensive and time-consuming to collect, the marginbased selection of training samples is a way to optimise ensemble classification design, boost efficiency and reduce reference data resource and processing costs. Exploring the influence of tree pruning (through the variation of minimum node size) on classification performance, demonstrated that unpruned decision trees (minimum node size of 1) achieve both the highest single tree classification accuracy and the highest diversity among ensemble members, two ingredients for optimal ensemble classification performance. This result partly explains the superiority of random forests, which use unpruned trees, over other tree-based ensembles such as boosting and bagging, which involve tree pruning. The findings of this study may inform the design of training data collection strategies and ensemble classification design and parameterisation. Future research will investigate the combined use of ensemble diversity and ensemble margin, two key concepts in ensemble learning, to guide RF training data selection for improved learning and better large area land cover mapping performance.
References Baccini, A., Laporte, N.T.T., Goetz, S.J., Sun, M., Dong, H.A., 2008. A first map of tropical Africa’s above-ground biomass derived from satellite imagery. Environ. Res. Lett. 3, 1–9. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P., 2005. Ensemble diversity measures and their application to thinning. Inf. Fusion 6, 49–62. http://dx.doi. org/10.1016/j.inffus.2004.04.005. Beaumont, L., Hughes, L., Poulsen, M., 2005. Predicting species distributions: use of climatic parameters in BIOCLIM and its impact on predictions of species’ current and future distributions. Ecol. Modell. 186, 250–269. Belgiu, M., Dra˘gutß, L., 2016. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 114, 24– 31. http://dx.doi.org/10.1016/j.isprsjprs.2016.01.011. Bi, Y., 2012. The impact of diversity on the accuracy of evidential classifier ensembles. Int. J. Approx. Reason. 53, 584–607. Boyd, D.S., Danson, F.M., 2005. Satellite remote sensing of forest resources: three decades of research development. Prog. Phys. Geogr. 29, 1–26. http://dx.doi.org/ 10.1191/0309133305 pp432ra. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. http://dx.doi.org/ 10.1023/A:1010933404324. Carletta, J., 1996. Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22, 249–254. CSIRO, 2011. One-second SRTM digital elevation model [WWW Document]. URL http://www.csiro.au/Outcomes/Water/Water-information-systems/Onesecond-SRTM-Digital-Elevation-Model.aspx. Cutler, D., Jr, T.E., Beard, K., 2007. Random forests for classification in ecology. Ecology 88, 2783–2792. Dalponte, M., Orka, H.O., Gobakken, T., Gianelle, D., Naesset, E., 2013. Tree species classification in boreal forests with hyperspectral data. IEEE Trans. Geosci. Remote Sens. 51, 2632–2645. http://dx.doi.org/10.1109/TGRS.2012.2216272. Department of agriculture fisheries and forestry, 2012. Australia’s forest at a glance. Canberra. Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I., Kuncheva, L.I., 2015. Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Sci. (Ny) 325, 98–117. http://dx.doi.org/10.1016/j.ins.2015.07.025. Efron, B., Tibshirani, R.J., 1994. An introduction to the bootstrap. Chapman & Hall/ CRC. Elghazel, H., Aussem, A., Perraud, F., 2011. Trading-off diversity and accuracy for optimal ensemble tree selection in random forests. In: Okun, O., Valentini, G., Re, M. (Eds.), Ensembles in machine learning applications, Studies in Comput. Intell. Springer, Berlin, Heidelberg, pp. 169–179. http://dx.doi.org/10.1007/9783-642-22910-7. Evans, J.S., Cushman, S., 2009. Gradient modeling of conifer species using random forests. Landsc. Ecol. 24, 673–683.
Farmer, E., Jones, S., Clarke, C., Buxton, L., Soto-Berelov, M., Page, S., Mellor, A., Haywood, A., 2013. Creating a large area landcover dataset for public land monitoring and reporting. In: Arrowsmith, C., Bellman, C., Cartwright, W., Jones, S., Shortis, M. (Eds.), Progress in Geospatial Sci. Res.. Publishing Solutions, Melbourne, pp. 85–98. Flood, N., Danaher, T., Gill, T., Gillingham, S., 2013. An operational scheme for deriving standardised surface reflectance from landsat TM/ETM+ and SPOT HRG imagery for eastern Australia. Remote Sens. 5, 83–109. http://dx.doi.org/ 10.3390/rs5010083. Food and agriculture organization of the united nations, 2001. Global forest resources assessment 2000. Foody, G., Pal, M., Rocchini, D., Garzon-Lopez, C., Bastin, L., 2016. The sensitivity of mapping methods to reference data quality: training supervised image classifications with imperfect reference data. ISPRS Int. J. Geo-Information 5, 199. http://dx.doi.org/10.3390/ijgi5110199. Franklin, J., 1995. Predictive vegetation mapping: Geographic modelling of biospatial patterns in relation to environmental gradients. Prog. Phys. Geogr. 19, 474–499. Guisan, A., Zimmermann, N.E., 2000. Predictive habitat distribution models in ecology. Ecol. Modell. 135, 147–186. http://dx.doi.org/10.1016/S0304-3800(00) 00354-9. Guo, L., 2011. Margin framework for ensemble classifiers. application to remote sensing data. University of Bordeaux. Guo, L., Boukir, S., 2013. Margin-based ordered aggregation for ensemble pruning. Pattern Recognit. Lett. 34, 603–609. http://dx.doi.org/10.1016/ j.patrec.2013.01.003. Guo, L., Boukir, S., 2014. Ensemble margin framework for image classification. In: ICIP 2014, IEEE Int. Conf. Image Proc., pp. 4231–4235. Guo, L., Chehata, N., Mallet, C., Boukir, S., 2011. Relevance of airborne lidar and multispectral image data for urban scene classification using Random Forests. ISPRS J. Photogramm. Remote Sens. 66, 56–66. http://dx.doi.org/10.1016/j. isprsjprs.2010.08.007. Ham, J., Chen, Y., Crawford, M.M., Member, S., Ghosh, J., 2005. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43, 492–501. Hansen, L., Salamon, P., 1990. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993–1001. Haralich, R.M., 1979. Statistical and structural approach to texture. Proc. IEEE 67, 786–804. Houlder, D., 2001. ANUCLIM (version 5.1). Kapp, M.N., Sabourin, R., Maupin, P., 2007. An empirical study on diversity measures and margin theory for ensembles of classifiers. In: 2007 10th Int. Conf. Inform. Fusion. IEEE, pp. 1–8. http://dx.doi.org/10.1109/ICIF.2007.4408144. Kayitakire, F., Hamel, C., Defourny, P., 2006. Retrieving forest structure variables based on image texture analysis and IKONOS-2 imagery. Remote Sens. Environ. 102, 390–401. http://dx.doi.org/10.1016/j.rse.2006.02.022. Kohavi, R., Wolpert, D., 1996. Bias plus variance decomposition for zero-one loss functions. In: 13th Int. Conf. Machine Lear., ICML ’96, pp. 275–283. Kuncheva, L., Whitaker, C., 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51, 181–207. Liaw, A., Wiener, M., 2002. Classification and regression by random forest. R News 2, 18–22. Mellor, A., Boukir, S., Haywood, A., Jones, S., 2014. Using ensemble margin to explore issues of training data imbalance and mislabeling on large area land cover classification. In: 2014 IEEE Int. Conf. Image Proc. (ICIP). IEEE, pp. 5067–5071. http://dx.doi.org/10.1109/ICIP.2014.7026026. Mellor, A., Boukir, S., Haywood, A., Jones, S., 2015. Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin. ISPRS J. Photogramm. Remote Sens. 105, 155–168. http://dx.doi.org/10.1016/j.isprsjprs.2015.03.014. Mellor, A., Haywood, A., 2010. Remote sensing Victoria’s public land forests – a two tiered synoptic approach. In: Proc. 15th Australian Remote Sens. Photogramm. Conf.. Alice Springs. Mellor, A., Haywood, A., Stone, C., Jones, S., 2013. The performance of random forests in an operational setting for large area sclerophyll forest classification. Remote Sens. 5, 2838–2856. http://dx.doi.org/10.3390/rs5062838. Melville, P., Mooney, R.J., 2005. Creating diversity in ensembles using artificial data. Inf. Fusion 6, 99–111. http://dx.doi.org/10.1016/j.inffus.2004.04.001. Opitz, D., Maclin, R., 1999. Popular ensemble methods: an empirical study. J. Artif. Intell. Res. 11, 169–198. Paget, M.J., King, E.A., 2008. MODIS land data sets for the australian region. Canberra. Pflugmacher, D., Cohen, W.B., Kennedy, E.R., 2012. Using landsat-derived disturbance history (1972–2010) to predict current forest structure. Remote Sens. Environ. 122, 146–165. http://dx.doi.org/10.1016/j.rse.2011.09.025. Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6, 21–45. http://dx.doi.org/10.1109/MCAS.2006.1688199. R Development Core Team, 2011. R: A Language and Environment for Statistical Computing. Rodríguez-Galiano, V.F., Abarca-Hernández, F., Ghimire, B., Chica-Olmo, M., Atkinson, P.M., Jeganathan, C., 2011. Incorporating spatial variability measures in land-cover classification using random forest. Procedia Environ. Sci. 3, 44–49. Rodriguez-Galiano, V.F., Chica-Olmo, M., Abarca-Hernandez, F., Atkinson, P.M., Jeganathan, C., 2012. Random forest classification of mediterranean land cover
A. Mellor, S. Boukir / ISPRS Journal of Photogrammetry and Remote Sensing 129 (2017) 151–161 using multi-seasonal imagery and multi-seasonal texture. Remote Sens. Environ. 121, 93–107. http://dx.doi.org/10.1016/j.rse.2011.12.003. Schapire, R., Freund, Y., 1998. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686. http://dx.doi.org/ 10.1214/aos/1024691352.
161
Tumer, K., Ghosh, J., 1996. Error correlation and error reduction in ensemble classifiers. Conn. Sci. 8, 385–404. http://dx.doi.org/10.1080/095400996116839. Wilkes, P., Jones, S., Suarez, L., Mellor, A., 2015. Mapping forest canopy height across large areas by upscaling als estimates with freely available satellite data. Remote Sens. 7, 1–25. http://dx.doi.org/10.3390/rs70x000x.