ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs
Classifier ensembles for land cover mapping using multitemporal SAR imagery Björn Waske a,∗ , Matthias Braun b,1 a
University of Iceland, Department of Electrical and Computer Engineering, Hajardarhagi 2-6, 107 Reykjavik, Iceland
b
Center for Remote Sensing of Land Surfaces, University of Bonn, Walter-Flex-Street 3, 53113 Bonn, Germany
article
info
Article history: Received 31 October 2007 Received in revised form 6 January 2009 Accepted 14 January 2009 Available online 23 February 2009 Keywords: Decision tree Random forests Boosting Multitemporal SAR data Land cover classification
abstract SAR data are almost independent from weather conditions, and thus are well suited for mapping of seasonally changing variables such as land cover. In regard to recent and upcoming missions, multitemporal and multi-frequency approaches become even more attractive. In the present study, classifier ensembles (i.e., boosted decision tree and random forests) are applied to multi-temporal C-band SAR data, from different study sites and years. A detailed accuracy assessment shows that classifier ensembles, in particularly random forests, outperform standard approaches like a single decision tree and a conventional maximum likelihood classifier by more than 10% independently from the site and year. They reach up to almost 84% of overall accuracy in rural areas with large plots. Visual interpretation confirms the statistical accuracy assessment and reveals that also typical random noise is considerably reduced. In addition the results demonstrate that random forests are less sensitive to the number of training samples and perform well even with only a small number. Random forests are computationally highly efficient and are hence considered very well suited for land cover classifications of future multifrequency and multitemporal stacks of SAR imagery. © 2009 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
1. Introduction Land cover mapping is one of the core application fields of remote sensing data. Hereby, agricultural areas are characterized by their typical spatial patterns, but also by their temporal dynamics and changing backscattering behaviour due to crop phenology and plant status. Hence, the utilization of monotemporal imagery is often inefficient and the use of multitemporal data is favoured in order to make use of these temporal effects for a better discrimination of surface types (e.g., Brisco and Brown (1995) and Blaes et al. (2005)). Many studies are based on multispectral remote sensing images, but the availability and efficiency of optical data sets is often limited by illumination and weather conditions, particularly in regions in Central Europe. Since the backscatter intensity of synthetic aperture radar (SAR) data is almost independent from weather conditions, multi-temporal data sets within one growing season can be produced most reliably using SAR imagery. Hence the data are particularly interesting for near-real-time applications and operational monitoring systems. Considering missions with high revisit times and better spatial resolutions like TerraSAR-X
∗
Corresponding author. Tel.: +354 5254670; fax: +354 5254632. E-mail address:
[email protected] (B. Waske).
1 Tel.: +49 228 734975; fax: +49 228 736857.
(11 days; up to 1 m) SAR-based classifications become even more attractive. Although the classification and interpretation of SAR data seems often more difficult than those of multispectral imagery, several studies have assessed the positive impact of multitemporal SAR imagery on the classification accuracy (Chakraborty et al., 1997; Panigrahy et al., 1999; Brisco and Brown, 1995; Chust et al., 2004; Tso and Mather, 1999): In Brisco and Brown (1995) the overall accuracy of a classification that is based on four SAR acquisitions was increased up to 24%, compared to results achieved with two images. Blaes et al. (2005) have compared the performance of various multitemporal ERS data sets. The classification accuracies were significantly increased by increasing the quantity of images. Depending on the number of scenes and the acquisition date, the overall accuracies vary between 40% and 65%. Similar results have been reported from Chust et al. (2004), which classified ERS data from a Mediterranean region. In addition these results show that the variance of the overall accuracy is decreased with an increasing number of images. Although some studies employ conventional methods like the well known maximum likelihood classifier (e.g., Brisco and Brown (1995) and Chust et al. (2004)) such widely used statistical classifiers are often not optimal for classifying high-dimensional image stacks such as those from multisource and multitemporal data sets. Multitemporal sets with high temporal and spatial resolution might become very large and complex. Furthermore
0924-2716/$ – see front matter © 2009 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved. doi:10.1016/j.isprsjprs.2009.01.003
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
such imagery can contain unnecessary details and irrelevant information. In addition, the individual data sources may not be equally applicable. One acquisition might be more adequate to describe a specific class than another scene of the time series. Thus it might be adequate to weight the different images during the classification process, but typical common statistical techniques do not allow such weighting or the weighting has to be chosen rather subjectively by the operator. In addition, in most cases the class distributions cannot be modelled by adequate multivariate statistical models (Benediktsson et al., 1990; Bruzzone et al., 2004). The Hughes phenomenon is another effect, which reduces the performance of conventional statistical approaches, such as the maximum likelihood classifier. Consequently, more sophisticated classification strategies and nonparametric algorithms are more promising: Artificial neural networks (ANN) are a nonparametric method, which have been used successfully for the classification of diverse remote sensing imagery. The overall accuracies are often significantly improved compared to conventional statistical classifiers: Benediktsson et al. (1990) have used a backpropagation ANN for the classification of multisource data sets, containing multispectral data and topographical information. In other experiments ANN were used for the classification of time series of SAR imagery (Chakraborty et al., 1997; Panigrahy et al., 1999; Stankiewicz, 2006). Beside the long training time, neural networks have no consistent rules for the network design and their performance is affected by several factors, e.g. the network architecture (Foody and Arora, 1997), which is dependent on the operator. Self-learning decision tree classifiers (DT) are another method that is applied on remote sensing imagery (Friedl and Brodley, 1997; Pal and Mather, 2003; Simard and Saatchi, 2000). The handling of DTs is rather simple and their training time is relatively low compared to computationally complex approaches as for example neural networks (Pal and Mather, 2003; Friedl and Brodley, 1997). Beside this, their visible classification scheme allows a direct interpretation of the decision with regard to the impact of individual features (i.e., a specific image band or image acquisition). In contrast to other classifier algorithms, which use the whole feature space at once and make a single membership decision, a decision tree is based on a hierarchical concept. A DT classifier is composed of an initial root node, several split nodes and the final leaf nodes. The root node is composed of the full training data set, which is split into two descendent nodes after applying the split rule. The two descendent nodes are split again. Finally the leaf nodes refer to the land cover class assigned to each pixel within this node. The most commonly used DT classifiers are binary trees. At each node the most relevant feature is selected and used for the construction of a binary test (i.e., decision boundary). Thus the node of a binary classifier is based only on one feature, e.g., a single acquisition of an image time series, whereas multivariate DT can use several features at each node. For a given unknown sample, this specific feature is compared to the test and propagates to one of the two descendants. The performance of DTs is increased by classifier ensembles or multiple classifier systems (Gislason et al., 2006; Briem et al., 2002; Carreiras et al., 2006; Pal, 2005). By training a classifier on resampled input data (i.e., features or training samples) a set of independent classifiers is generated. Afterwards the different outputs are combined to create the final result. Although the concept is not restricted to decision trees, DT classifiers are particularly interesting due to their simple handling and fast training time: Brown de Colstoun et al. (2003) applied a decision tree on multitemporal images from the LANDSAT Enhanced Thematic Mapper-Plus (ETM+) sensor to differentiate 11 land cover types. The overall accuracy was significantly increased by classifier ensemble techniques, as well as by boosting. Carreiras
451
et al. (2006) carried out a classification of agricultural and pasture land within the Brazilian Amazon using a time series of SPOT 4 Vegetation data. In this study a DT-based classifier ensemble significantly outperformed all other approaches (i.e., maximum likelihood classifier, simple decision tree and k-nearest neighbour) in terms of overall accuracy. Briem et al. (2002) successfully used different classifier ensembles for classifying different multisource data sets, including SAR and multispectral imagery among other data types. Waske and Benediktsson (2007) applied a boosted decision tree to a multisensor data set, consisting of multitemporal SAR data and multispectral imagery. The classifier ensemble performed better than conventional methods such as simple decision trees and maximum likelihood classifier when classifying multi-source SAR and optical data. Breimans’ classifier system random forests (2001) was used in diverse remote sensing studies (Ham et al., 2005; Lawrence et al., 2006; Gislason et al., 2006; Pal, 2005). The results in Ham et al. (2005) assessed a good performance of the random forests (RF) for classifying hyperspectral data with a limited sample set. In Gislason et al. (2006) this method was applied to a multisource data set, consisting of Landsat MSS data and topographical data. The RF performs better than a single decision tree and is comparable to other ensemble methods, whereas the computation time is much faster. Pal (2005) has used the approach for the classification of a Landsat ETM+scene from an agricultural region. In this study, RF achieved promising results and the accuracies were comparable to computationally more complex methods like support vector machines. In Waske and van der Linden (2008) the classifier was combined with support vector machines in a sequential approach for classifying multisensor imagery from different segmentation scales. The results also underline the general good performance of the random forests concept. Although classifier ensembles have given promising results, only a few applications are known for SAR data, in particular for multitemporal SAR data. In Waske et al. (2006) the concept of random feature selection is applied to a set of multitemporal SAR data. The results clearly demonstrate that the approach is more effective than a simple decision tree. In Waske and Benediktsson (2007) multisource data (SAR and multispectral) was classified using a boosted decision tree among other classifiers. In this study, random forests are applied to three multitemporal C-band SAR data sets (ERS-2 and ENVISAT) from agricultural regions. The impact of training sample size on the classifier performance is investigated. The results are benchmarked with various well-known algorithms such as a maximum likelihood classifier and a simple decision tree. A comparison with the results of these approaches is worthwhile, regarding the numerous applications that are based on the well-known algorithms as well as the general differences of these two classifier concepts. In addition, two different classifier ensembles are compared (i.e., boosting and RF). The spatial transferability of the approach is evaluated using study sites in West (Bonn) and East Germany (Jena, Goldene Aue) characterized by mean plot sizes of 5–7 ha and 20–30 ha respectively. Finally, the performance for two different years (2005 and 2007) is compared. 2. Classifier ensembles Besides various applications in the field of remote sensing and pattern recognition, it has been shown theoretically that the classification accuracy can be increased by combining different independent classifiers (Schapire, 1990; Tumer and Ghosh, 1996). Two strategies exist to generate a classifier ensemble or a multiple classifier system: (1) a combination of different classifier algorithms and (2) a combination of variants of the same algorithm.
452
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
Fig. 1. Schematic concept of a decision tree (DT) based classifier ensemble. The different classifiers are generated by (randomly) modifying the training samples or input features.
The presented study is focused on the latter approach: By training the so-called base classifier on modified input data (i.e., training samples or input features) a set of independent classifiers can be generated. These outputs are then combined to the final result by a voting scheme (Fig. 1). A simple majority voting is often utilized, which can be more effective than more complex voting strategies. The general concept is based on the assumption that independent classifiers produce individual errors, which are not produced by the majority of the other classifiers. Various strategies for generating classifier ensembles have been introduced, for example a resampling of the training data (i.e., bagging or boosting) and the input features (e.g., random feature selection). Banfield et al. (2007) give an introduction and assessment to various types of classifier ensembles; a brief overview of the main concepts is given below. Breiman’s (1996) bootstrap aggregating (bagging) describes the random generation of training sample subsets also known as bootstrapped aggregates or bags. The approach is based on the random and uniform selection – with replacement – of n samples from a training set of the same size n, i.e., a training sample can be selected several times in the same sample set and perhaps other samples are not considered in this particularly bag. Afterwards an individual decision tree is trained on each of these bags, resulting in various independent classifier outputs. The final classification map is generated by combining the individual outputs. The concept of boosting was originally introduced by Schapire (1990) as an approach to improve the performance of a weak learning algorithm. For the iterative training process, all training samples are equally weighted in the beginning. Boosting successively changes the weights of the training samples during the training process, comparing the outputs with the known class memberships of the samples. In the initial phase all training samples are equally weighted. During the boosting process misclassified samples are assigned a stronger weight than those classified correctly. The training of the next DT within the ensemble is based on the newly distributed, reweighed samples. In doing so, the classifier is forced to concentrate on the misclassified samples that are more difficult to classify and this can reduce the variance and bias of the classification. Unlike bagging that can be performed simultaneously, boosting generates the different classifiers in an iterative procedure. Consequently, it is computationally rather slow. The AdaBoost.M1 approach (Freund and Schapire, 1996) is widely used in the field of pattern recognition and remote sensing. Besides resampling of the training data, the modification of the input feature space, e.g. by a random selection of features (i.e., a specific band or image acquisition), is another concept for generating independent classifiers (Ho, 1998; Bryll et al., 2003). It has been shown that this random feature selection approach can be superior to bagging and boosting (Ho, 1998; Bryll et al., 2003).
In contrast to the two aforementioned data partitioning strategies, the training samples remain unchanged by this concept. For each DT a subset of features is created. Unlike bagging that collects n samples from a sample set of size n, the method normally selects a subset of the available input features without replacement. Breiman’s (2001) random forests technique uses a set of decision trees {DT(x, θm ), m = 1, . . . , }, where θm denotes independent identically distributed random vectors and x an input pattern. Each tree within the ensemble is trained on a subset of the original training samples; in addition the split rule at each split is determined, using only a randomly selected feature subset of the input data.2 In comparison to classifier training on the full data set, various trees with different split rules are generated. Consequently several different classification results are obtained. A simple majority vote is used to create the final classification result. The number of selected features within the subset is user defined, and the parameter is usually set to the square root of the number of input features (Gislason et al., 2006). The computational complexity of the individual DT classifier is simplified, by reducing the number of features at each split. This enables random forests to handle high-dimensional data sets. In addition the correlation between the classifiers is decreased, which generally improves the performance of a classifier system. From a computational view the method is lighter than the bagging and boosting concepts because it is only based on subsets of input data (Gislason et al., 2006). 3. Study sites and database The almost flat study site is located near Bonn in the German state of North Rhine-Westphalia, in the Köln Aachener Bucht. The landscape is dominated by agricultural use, with wheat, barley and sugar beets as main crops. The agricultural plot size is varying around 5 ha. For the classification and validation procedures detailed field survey were carried out both years 2005 and 2007. The study site Goldene Aue near Jena (Thuringia) is characterized by large plots of 20 to 30 ha. Main land cover classes are wheat, rapeseed and maize. The topography is also flat and hence almost no terrain effects influence the SAR backscatter signal. The database for the study consists of three time series of C-band SAR data from the ERS-2 SAR and ENVISAT ASAR instruments (Table 1). All data were ordered in precision image format, comprising different swaths and polarizations, and calibrated according to ESA recommendations by Laur et al. (2004). Subsequently, the images were co-registered and orthorectified.
2 The term sample refers to an individual image pixel, whereas the term feature refers to a single image acquisition within the time series.
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
453
Table 1 Data sets from the ENVISAT ASAR and ERS-2 SAR instruments utilized in this study. Bonn 2005
Bonn 2007
Jena 2005
Platform
Date
Polarization
Platform
Date
Polarization
Platform
Date
Polarization
ENVISAT ERS-2 ENVISAT ENVISAT ERS-2 ERS-2 ERS-2 ERS-2 ERS-2 ENVISAT ENVISAT ENVISAT ERS-2 ENVISAT ENVISAT
11-Mar 17-Mar 12-Apr 15-Apr 15-Apr 21-Apr 26-May 24-Jun 30-Jun 10-Jul 13-Jul 22-Jul 4-Aug 14-Aug 8-Sep
HH VV HH/HV HH VV VV VV VV VV HH/HV HH/HV HH/HV VV HH/HV VV
ENVISAT ENVISAT ENVISAT ENVISAT ERS-2 ENVISAT ENVISAT ENVISAT ERS-2 ENVISAT ENVISAT ENVISAT ERS-2 ENVISAT ENVISAT ERS-2 ERS-2
16-Mar 1-Apr 4-Apr 20-Apr 26-Apr 6-May 9-May 25-May 31-May 10-Jun 13-Jun 29-Jun 5-Jul 15-Jul 18-Jul 9-Aug 13-Sep
HH/HV HH/HV HH/HV HH/HV VV HH/HV HH/HV HH/HV VV HH/HV HH/HV HH/HV VV HH/HV HH/HV VV VV
ENVISAT ERS-2 ERS-2 ENVISAT ENVISAT ERS-2 ENVISAT ENVISAT ERS-2 ENVISAT ERS-2 ENVISAT ERS-2 ENVISAT ENVISAT ERS-2 ENVISAT
22-Apr 1-May 11-May 11-May 27-May 5-Jun 5-Jun 15-Jun 10-Jul 10-Jul 20-Jul 4-Aug 14-Aug 14-Aug 24-Aug 18-Sep 18-Sep
HH/HV VV VV HH/HV HH/HV VV HH/HV HH/HV VV HH / HV VV HH / HV VV HH / HV HH/HV VV HH/HV
For speckle reduction an enhanced Frost filter with a 7 × 7 kernel was applied. For both study sites extensive ground truth campaigns were conducted in the corresponding years. 4. Methods In the presented study two different classifier ensembles are applied to the data set: boosting and random forests. They are benchmarked against two other classification approaches: a standard Gaussian maximum likelihood classifier (MLC) and a common DT. Duda et al. (2000) give an overview of these classifier concepts. A general introduction on supervised classification in the context of remote sensing is given in Richards and Jia (2006). The MLC is one of the most commonly used supervised classification techniques. A pixel is assigned to the class whose likelihood is the highest. Although in our case the MLC assumes a Gaussian distribution, which is not ideal in the context of SAR imagery and multitemporal data sets, many studies utilize this approach due to its simplicity and implementation in almost all standard remote sensing software packages. It is thus regarded as a kind of benchmark classifier for comparison with new approaches. Decision tree classifiers successively partition the training data into an increasing number of smaller homogenous classes by producing efficient rules, estimated from the training data. One main element of the classifier is the split rule, which is used at each node of the tree and determines the test and how the data set is split. Although various methods were introduced, different approaches result in comparable classification accuracies (Mingers, 1989; Pal and Mather, 2003; Zambon et al., 2006). A common rule is the information gain ratio criterion, which is implemented in the used algorithm C5.0 (Quinlan, 1993). The criterion is based on the measurement of the reduction in the entropy of the data created by each split. In the random forests approach the Gini index (Breiman et al., 1984) triggers the decision. The Gini index separates the largest homogeneous group within the training data from the remaining training samples by measuring the impurity at a split node (Zambon et al., 2006). It is defined as: Gini (t ) =
n X
pωi (1 − pωi )
i =1
with pωi as the probability or the relative frequency of class ωi at node t defined as: nωi pωi = n with nωi as the number of samples belonging to class ωi and n as the total number of samples within a particular node. For each
candidate split the impurity (i.e., Gini index) of the resulting child nodes is summed and compared to the parent node. The split that causes the maximum reduction in impurity is selected (Apte and Weiss, 1997). The C5.0 software code by Quinlan (1993) has been used for the simple decision tree as well as the boosted decision tree classification runs. The random forests classification in this study was performed, using a freely available Fortran code (http://www.stat.berkeley. edu/∼breiman/RandomForests/). For each random forests classifier, 500 decision trees were generated, using only a reduced number of features at each node. As in other studies (e.g., Gislason et al. (2006)) this value is set approximately to the square root of the number of input images, e.g.: The full data set Bonn 2005 contains 20 image (15 acquisitions + 5 images with additional polarizations), whereas the split rule at each node is derived from four randomly selected images. As mentioned above, the results of the individual decision trees were then fused by a majority voting. The following nine classes were included in the legend: Cereals, Coniferous forest, Grassland, Mixed forest, Orchards, Rapeseed, Root crops, Urban, Water. For the test site Goldene Aue a comparable sample set is generated, but instead of Orchards, the class Maize was considered. In the Bonn 2007 data set the class Maize was additionally introduced. Training and validation data sets can be generated in different sampling strategies e.g. simple random sampling, systematic sampling or stratified random sampling. Using the first method each sample has an equal chance to be selected, the systematic approach selects samples with an equal interval over the study area. Stratified random sampling combines a priori knowledge about a study area – like the land cover information – with the simple random sampling approach (Congalton and Green, 1999). When using land use cover classes as a prior knowledge, the stratified random sampling guarantees, that all classes are included in the sample set. In our case, for the generation of independent training and validation sets a stratified random sampling approach was chosen, using the corresponding ground truth data as a priori knowledge. To investigate the impact of the number of training samples on the classification accuracy, different sizes of the training sets were generated containing 15, 25, and 50 samples per land cover class (from know on referred to as #15, #25 and #50). Consequently the training set #50 contains 450 training pixels (50 samples for each of the nine classes), whereas the training set #15 contains only 135 samples. Using the stratified random sampling as before, an independent validation set was generated for all test sites and years, consisting of 50 samples per land cover class.
454
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
Table 2 Overall classification accuracy of the four different algorithms for the test site Bonn in 2005 (training sample set 50). Classifier
Test site/Year Bonn 2005
DT MLC Boosted DT Random Forests
55.8 58.2 67.8 75.3
5. Results and discussion Four different classifier algorithms (MLC, DT, a boosted DT and RF) were applied to a data set from the test site Bonn (Table 2). The experimental results clearly show the positive effect of classifier ensembles: Comparing the four approaches it can be observed, that the common DT performs worst in terms of accuracy, followed by the MLC. The two DT-based classifier systems significantly outperform the two conventional classifiers. The RF concept increases the accuracy by 19.5% in regard to a single DT and by 17% in regard to a standard MLC. Furthermore, the good performance of RF is underlined by a difference of 7.5% between a boosted DT and the RF. The visible assessment of the classification results confirms the general good performance of classifier ensembles (Fig. 2). The maps from the simple decision tree and MLC show the general structures of the classified area, but appear very noisy even in homogenous areas. Sometimes this salt-and-pepper effect reaches a level where the correct land cover type cannot be assigned any more to the entire plot (Fig. 2(a),(b)). Many borders between individual agricultural parcels appear blurred and are hard to identify. This drawback is significantly reduced by the boosted DT and the RF (Fig. 2(c),(d)). Almost all areas can be assigned properly to a specific class. Edges along different natural objects can be more clearly identified. Nevertheless still some noise exists. Visually it seems that some confusion between classes such as urban and forest as well as coniferous and mixed forests is still inherent. The comparison of the maps, produced by the boosted DT and the RF underlines the statistical assessment: RF result in a more homogenous map, than boosting.
The main reason for the success is certainly the underlying assumption of classifier ensembles. The concept is based on the generation of independent classifiers and the performance is directly influenced by this independency. Given the relative independency of images from different acquisition times, a concept which is based on a random selection of the input features seems well suited for multitemporal approaches. The independency of the individual classifiers and thus the performance of the ensemble are further increased by combining this feature selection with a random selection of training samples, as done in random forests. A second reason for the higher accuracy of RF is that the performance of boosting can be reduced by the inherent noise of SAR data. During the boosting process the algorithm starts to concentrate on samples, which are difficult to classify. Consequently it aims also at noisy outliners and the DT is overfitted to these samples (Bauer and Kohavi, 1999). The good performance of the random forests is additionally underlined by the class specific accuracies (Table 3). Apart from a few exceptions, the RF clearly outperforms the producer and user accuracies of the other methods. Moreover the results are more stable, resulting in less variance of the accuracies. Considering the level of detail (e.g. different crop types) and complexity of the classes (e.g. orchards, urban) the achieved accuracies assured by a rigorous independent validation of the RF (almost everywhere >60%) are very promising. The quality of all classifiers is lowest for grassland sites followed by the coniferous forest class. The detailed accuracy matrix for the RF of Table 4 reveals that confusions of grassland particularly occur with orchards, rapeseeds and coniferous forest, but also coniferous forest is mixed with grassland and orchards. It is obvious that orchards and coniferous forest exhibit as rough surfaces in C-band SAR. However in this context it is striking that mixed forest is captured rather well. Surprisingly low also is the producer accuracy of water in the MLC (Table 3). However, also the RF misses many assignments to mixed forest (Table 4), while the user accuracy is excellent. The structurally most heterogeneous class ‘‘urban’’ also displays this in the detailed producer accuracies with erroneous placements of validation pixels to almost any class. However, also in this case the overall user accuracy is more than reasonable at 94.1%. All
Table 3 Producer and user accuracies [%] for all four classifiers applied on data set Bonn 2005 (training sample set #50). Classifier
Cereals Coniferous forest Grassland Mixed forest Orchards Rapeseed Rootcrops Urban Water
DT
MLC
Boosted DT
RF
Producer
User
Producer
User
Producer
User
Producer
User
64 46 44 62 54 56 80 44 52
61.5 52.3 45.8 46.3 29.3 77.8 83.3 59.5 100.0
72 34 50 70 54 70 78 76 20
75 63 51 49.3 42.9 54.7 67.2 64.4 90.9
74 44 58 70 68 70 94 64 68
69.8 62.9 58 64.8 49.3 70 73.4 78 100
88 60 64 80 78 78 90 64 76
77.2 75 58.2 70.2 56.5 81.3 86.5 94.1 100
Table 4 Complete confusion matrix for the random forest classifier for the data set Bonn 2005 (training sample set #50). Land cover class
Cereals Coniferous forest Grassland Mixed forest Orchards Rapeseed Rootcrops Urban Water
Ground truth (50 samples per class) Cereals
Conif.
Grassl.
Mixed f.
Orchards
Rapeseed
Rootcr.
Urban
Water
44 0 0 4 1 0 1 0 0
0 30 9 0 9 2 0 0 0
0 6 32 0 6 4 0 2 0
5 0 1 40 3 1 0 0 0
4 1 1 2 39 0 3 0 0
1 1 5 1 1 39 2 0 0
0 0 0 0 5 0 45 0 0
2 2 6 1 4 2 1 32 0
1 0 1 9 1 0 0 0 38
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
455
Fig. 2. Classification results of the Bonn 2005 data set (training set #50) and the corresponding ground truth (e), using a single DT (a), MLC (b), boosted DT (c), and RF (d).
other classes show no major outliers, which is most interesting from an application point for the agricultural classes as their areal quantification is important for yield forecasts or (site-specific) management actions. Beside the application of different algorithms, random forests were trained with three different sizes of the training sample sets, including 15, 30, and 50 samples per class. The results (Table 5) demonstrate the positive effect of increasing the number of training samples on the one hand and the stability and good performance of random forests on the other. The total accuracy is slightly improved by 5% doubling the sample size from 15 to 30, and slightly by 0.8% by increasing the number of training
Table 5 Overall accuracies [%] using random forests with different numbers of the training samples. Training sample set
Test site/Year Bonn 2005
#15 #30 #50
68.9 74.5 75.3
samples to 50. Nevertheless the results demonstrate clearly that RF are relatively stable in regard to sample size and that even small training sets are adequate to achieve high quality output.
456
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
Fig. 3. Classification results using random forests for the Bonn 2007 data set (a) and the study site Golden Aue near Jena in 2005 (b). Table 6 Overall accuracies [%] for the method transfer to the data sets of Bonn 2007 and Jena 2005. Classifier
DT MLC boosted DT Random Forests
Test site/Year Bonn 2007
Jena 2005
60.6 59.4 74.0 79.8
71.4 63.3 78.7 83.8
Temporal and spatial transferability For operational applications, it is important that an approach delivers comparable results not only for a single data set. Hence, the classifiers are run on a second data set from another year at the same location Bonn 2007 and on a similar time series from a second test site Jena 2005 in order to validate the transferability. Each classification is performed, using corresponding ground truth and following the method described in Section 4. Table 6 gives the overall accuracies for these two data sets. Whereas on the Bonn 2005 data the MLC performs better than a single DT, now the DT classifier achieves higher accuracies than the MLC. However, the very good performance of the classifier ensembles can be reproduced again: a boosted DT, for example, improves the results by 15% compared to a conventional MLC. As before the RF approach achieves the highest accuracies exceeding the boosted DT by 5.8% on the Bonn 2007 data set and by 5.1% on the Jena site. The higher accuracies for the Jena site can be attributed to the considerably larger plots and hence resulting more homogeneous structures in this data set. The quality for the Bonn test site is comparable for both years with slightly better values for the year 2007. The reason for this might be that the 2007 data sets contain more imagery from May which is generally an important state in plant development of cereals and hence might support the separation of land cover classes. Still some confusions are present, mixing urban and forest as well as the two forest classes due to their similar temporal behaviour of the radar clutter.
The visual inspection (Fig. 3) again shows that classifier ensembles considerably reduce the salt-and-pepper effect inherent to most pixel-based land cover classifications on SAR data. 6. Conclusions and outlook Overall, random forests appear very well suited for classifying multitemporal SAR data. The results show clearly that classifier ensembles are superior to standard classification approaches like a simple decision tree or a maximum likelihood classifier on multitemporal SAR data. Regardless of the method used for the generation of classifier ensembles a higher accuracy is reached. Comparing the results achieved by different classifier ensembles, it can be assessed that random forests outperform the boosting technique. The good results for separating agricultural classes, which are hard to describe by mono-temporal analyses, underline the value of multitemporal analyses in this context. The approach performs well, even with a small number of training samples. Moreover, the approach can handle larger time series, without any prior feature selection. Another advantage of RF is computational efficiency since boosting algorithms require iterative procedures. Regarding this fact, the approach seems well suited for operational monitoring as it is also rather simple in implementation. The study reveals that in particular urban and forest areas are difficult to separate on the SAR backscatter time series alone. Here the integration of at least one optical scene shows advantages (Waske and Benediktsson, 2007; Waske and van der Linden, 2008) and should be considered if no multi-frequency data are available. Additional texture features might be helpful if only C-band data are available (Haralick et al., 1973; Soares et al., 1997). However, the concept of classifier ensembles will remain applicable and beneficial for such large multi-source data stacks. The need for efficient classifiers on SAR imagery becomes even more important in regard to the current and upcoming availability of multi-frequency SAR data in various polarizations and increasing temporal coverage. It can be expected that the quality of the classification can be considerably enhanced by
B. Waske, M. Braun / ISPRS Journal of Photogrammetry and Remote Sensing 64 (2009) 450–457
choosing an appropriate classifier for such multi-channel imagery even without running time-consuming segmentation procedures. Acknowledgements The authors would like to thank Tanja Riedel and Christiane Schmullius (Jena University) for supporting this study with satellite and ground truth data from their test site Goldene Aue. Funding for this study was provided by the German Aerospace Centre (DLR) and the German Ministry of Economy (BMWi) within the ENVILAND project under contract FKZ 50EE0404. Data were made available by ESA under CAT 1 (C1P 3115). References Apte, C., Weiss, S., 1997. Data mining with decision trees and decision rules. Future Generation Computer Systems 13 (2–3), 197–210. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P., 2007. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (1), 173–180. Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning 36 (1), 105–139. Benediktsson, J.A., Swain, P.H., Ersoy, O.K., 1990. Neural network approaches versus statistical-methods in classification of multisource remote-sensing data. IEEE Transactions on Geoscience and Remote Sensing 28 (4), 540–552. Blaes, X., Vanhalle, L., Defourny, P., 2005. Efficiency of crop identification based on optical and SAR image time series. Remote Sensing of Environment 96 (3–4), 352–365. Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees. Chapman and Hall, New York. Breiman, L., 1996. Bagging predictors. Machine Learning 24 (2), 123–140. Breiman, L., 2001. Random forests. Machine Learning 45 (1), 5–32. Briem, G.J., Benediktsson, J.A., Sveinsson, J.R., 2002. Multiple classifiers applied to multisource remote sensing data. IEEE Transactions on Geoscience and Remote Sensing 40 (10), 2291–2299. Brisco, B., Brown, R.J., 1995. Multidate SAR/TM synergism for crop classification in Western Canada. Photogrammetric Engineering and Remote Sensing 61 (8), 1009–1014. Brown de Colstoun, E.C., Story, M.H., Thompson, C., Commisso, K., Smith, T.G., Irons, J.R., 2003. National park vegetation mapping using multitemporal Landsat 7 data and a decision tree classifier. Remote Sensing of Environment 85 (3), 316–327. Bruzzone, L., Marconcini, M., Wegmuller, U., Wiesmann, A., 2004. An advanced system for the automatic classification of multitemporal SAR images. IEEE Transactions on Geoscience and Remote Sensing 42 (6), 1321–1334. Bryll, R., Gutierrez-Osuna, R., Quek, F., 2003. Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition 36 (6), 1291–1302. Carreiras, J.M.B., Pereira, J.M.C., Campagnolo, M.L., Shimabukuro, Y.E., 2006. Assessing the extent of agriculture/pasture and secondary succession forest in the Brazilian legal Amazon using SPOT VEGETATION data. Remote Sensing of Environment 101 (3), 283–298. Chakraborty, M., Panigrahy, S., Sharma, S.A., 1997. Discrimination of rice crop grown under different cultural practices using temporal ERS-1 synthetic aperture radar data. ISPRS Journal of Photogrammetry and Remote Sensing 52 (4), 183–191. Chust, G., Ducrot, D., Pretus, J.L.L., 2004. Land cover discrimination potential of radar multitemporal series and optical multispectral images in a Mediterranean cultural landscape. International Journal of Remote Sensing 25 (17), 3513–3528. Congalton, R.G., Green, K., 1999. Assessing the Accuracy of Remote Sensed Data: Principles and Practices, 1st ed. CRC Press. Duda, R.O., Hart, P.E., Stork, D.G., 2000. Pattern Classification, 2nd ed. John Wiley & Sons Inc., Chichester, New York.
457
Foody, G.M., Arora, M.K., 1997. An evaluation of some factors affecting the accuracy of classification by an artificial neural network. International Journal of Remote Sensing 18 (4), 799–810. Freund, Y., Schapire, R.E., 1996. Experiments with a new boosting algorithm. In: Proceedings 13th Intern. Conference on Machine Learning. pp. 148–156. Friedl, M.A., Brodley, C.E., 1997. Decision tree classification of land cover from remotely sensed data. Remote Sensing of Environment 61 (3), 399–409. Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R., 2006. Random forests for land cover classification. Pattern Recognition Letters 27 (4), 294–300. Ham, J., Chen, Y.C., Crawford, M.M., Ghosh, J., 2005. Investigation of the random forest framework for classification of hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing 43 (3), 492–501. Haralick, R.M., Shanmugam, K., Dinstein, I., 1973. Textural features for image classification. IEEE Transactions on Systems, Man, Cybernetics 3 (6), 610–621. Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8), 832–844. Laur, H., Bally, P., Meadows, P., Sanchez, J., Schaettler, B., Lopinto, E., Esteban, D., 2004. Derivation of the backscattering coefficient σ 0 in ESA ERS SAR PRI products. ESA Document ES-TN-RS-PM-HL09, Issue 2, Rev. 5f. Lawrence, R.L., Wood, S.D., Sheley, R.L., 2006. Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest). Remote Sensing of Environment 100 (3), 356–362. Mingers, J., 1989. An empirical comparison of selection measures for decision-tree induction. Machine Learning 3 (4), 319–342. Pal, M., Mather, P.M., 2003. An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sensing of Environment 86 (4), 554–565. Pal, M., 2005. Random forest classifier for remote sensing classification. International Journal of Remote Sensing 26 (1), 217–222. Panigrahy, S., Manjunath, K.R., Chakraborty, M., Kundu, N., Parihar, J.S., 1999. Evaluation of RADARSAT standard beam data for identification of potato and rice crops in India. ISPRS Journal of Photogrammetry and Remote Sensing 54 (4), 254–262. Quinlan, J.R., 1993. C4.5: Programs For Machine Learning. Morgan Kaufmann, Los Altos. Richards, J.A., Jia, X., 2006. Remote Sensing Digital Image Analysis: An Introduction, 4th ed. Springer, New York. Schapire, R.E., 1990. The strength of weak learnability. Machine Learning 5 (2), 197–227. Simard, M., Saatchi, S.S., De Grandi, G., 2000. The use of decision tree and multiscale texture for classification of JERS-1 SAR data over tropical forest. IEEE Transactions on Geoscience and Remote Sensing 38 (5), 2310–2321. Soares, J.J., Rennó, C.D., Formaggio, A.R., da Costa Freitas, Y.C., Frery, A.C., 1997. An investigation of the selection of texture features for crop discrimination using SAR imagery. Remote Sensing of Environment 59 (2), 234–247. Stankiewicz, K.A., 2006. The efficiency of crop recognition on ENVISAT ASAR images in two growing seasons. IEEE Transactions on Geoscience and Remote Sensing 44 (4), 806–814. Tso, B., Mather, P.M., 1999. Crop discrimination using multi-temporal SAR imagery. International Journal of Remote Sensing 20 (12), 2443–2460. Tumer, K., Ghosh, J., 1996. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition 29 (2), 341–348. Waske, B., Schiefer, S., Braun, M., 2006. Random feature selection for decision tree classification of multi-temporal SAR data. In: Proceedings IEEE International Geoscience and Remote Sensing Symposium, 2006, IGARSS 2006, July 31 2006–August 4, pp. 168–171. doi:10.1109/IGARSS.2006.48. Waske, B., Benediktsson, J.A., 2007. Fusion of support vector machines for classification of multisensor data. IEEE Transaction on Geoscience and Remote Sensing 45 (12), 3853–3854. Waske, B., van der Linden, S., 2008. Classifying multilevel imagery from SAR and optical sensors by decision fusion. IEEE Transaction on Geoscience and Remote Sensing 46 (5), 1457–1466. Zambon, M., Lawrence, R., Bunn, A., Powell, S., 2006. Effect of alternative splitting rules on image processing using classification tree analysis. Photogrammetric Engineering and Remote Sensing 72 (1), 25–30.