Case-based predictions for species and habitat mapping

Case-based predictions for species and habitat mapping

Ecological Modelling 177 (2004) 259–281 Case-based predictions for species and habitat mapping Kalle Remm∗ Institute of Geography, University of Tart...

324KB Sizes 13 Downloads 43 Views

Ecological Modelling 177 (2004) 259–281

Case-based predictions for species and habitat mapping Kalle Remm∗ Institute of Geography, University of Tartu, 46 Vanemuise Street, Tartu 51014, Estonia Received 25 June 2003; received in revised form 23 February 2004; accepted 4 March 2004

Abstract The specific feature of case-based predictions is the presence of empirical data within a predicting system. Case-based methods are especially well suited to situations where a single class is represented by more than one cluster, or a class fills an irregular shape in the feature space. Case-based ecological mapping relies on the assumption of finding a species (or other phenomenon) in locations similar to those where this species has already been registered. Lazy learning is a machine learning approach for fitting case-based prediction systems, which prefers raw data to generalizations. An overview of lazy learning methods: feature selection and weighting, fitting the number of exemplars or kernel extent used for decisions, indexing the case-base, learning new and forgetting useless exemplars, and exemplar weighting, is given. A case study of habitat and forest composition mapping was accomplished in Otepää Upland in South-East Estonia. Habitat classes were predicted and mapped on the whole study area; five characteristics of stand composition, presence/absence of Quercus robur, total coverage of forest stand, coverage of coniferous trees, and eight main tree species separately, were mapped on non-agricultural and non-settlement areas. The explanatory variables were derived from: Landsat 7 ETM image, greyscale and colour orthophotos, elevation model, 1:10 000 digital base map, and soil map. One thousand random locations were described in the field in order to obtain training data. Four methods of machine learning were compared. In calculations of similarity, exemplar and feature weights regulated both the influence of particular exemplars and features, and kernel extent. Goodness-of-fit of predictions was estimated using leave-one-out cross-validation. A machine learning method combining stepwise feature selection, feature weighting, and exemplar weighting reached the best results in the case of 10 response variables. A method involving iterative random sampling proved to be the best for the other seven variables. The best fit was found for variables: habitat class (κ = 0.85), oak presence/absence (mean true positive + mean true negative − 1 = 0.72), coverage of coniferous trees (R2 = 0.80), coverage of Pinus sylvestris (R2 = 0.72), and coverage of Picea abies (R2 = 0.73). In most cases, less than a half of training instances were retained as exemplars after case filtering, and less than half of the explanatory variables were used in predictive sets. All 31 explanatory variables were included in a predictive set of features at least once. The most valuable predictor was land cover category according to the 1:10,000 base map. © 2004 Elsevier B.V. All rights reserved. Keywords: Case-based reasoning; Lazy learning; Image interpretation; Species and habitat mapping

1. Introduction Computer-aided classification of remote sensing data, aerial photos, and soil and elevation data have ∗ Tel.: +372-7375827; fax: +372-7375825. E-mail address: [email protected] (K. Remm).

usually been used for landscape or vegetation mapping and species distribution modelling (Neave and Norton, 1998; Vogelmann et al., 1998; Münier et al., 2001). The most common method of satellite image interpretation is supervised classification using maximum likelihood estimates. Maximum likelihood classification proceeds by selecting the largest pos-

0304-3800/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.ecolmodel.2004.03.004

260

K. Remm / Ecological Modelling 177 (2004) 259–281

terior probability. The calculation of a histogram of pixel values for every class demands a large amount of training data. In practice, it is common to replace the probability density function of reflectance values in wavebands with parameters (mean, variance and covariance), assuming normal distribution of the reflectance values (Atkinson and Lewis, 2000). Difficulties in separating different forest types have been reported if using maximum likelihood classification (Aaviksoo et al., 2000). One possible way of improving classification reliability is to combine calculated classification with visual interpretation of satellite images, as was used in the preparation of the Estonian 1:50,000 base map (Sagris and Krusberg, 1997) and the Co-ordination of Information on the Environment (CORINE) land cover map (Meiner, 1999). Another possible way of overcoming the limitations of the presumption of normality is to look for other methods of automated data processing. All statistical prediction methods rely on certain presumptions, e.g. discriminant analysis presumes that variables are normally distributed and not correlated. Most artificial intelligence (AI) methods, on the other hand, are not based on an assumption of normality within training data. AI methods are well suited to situations where a class forms an irregular shape in the feature space or a single class is represented by more than one cluster. Nonparametric statistical methods, like generalized additive models, can be as flexible as AI methods, but still demand a formal model. Case-based reasoning is an extremely empirical AI approach that draws conclusions from observation data as directly as possible, without the formulation of a model as an intermediator between data and the prediction. The primary objective of this article is to introduce case-based reasoning and a machine learning technique—lazy learning—for habitat mapping and forest composition estimation, and to demonstrate these using a dataset of images, digital maps and field observations. The paper is organised as follows. The first part gives a short overview of species and habitat distribution modelling methods; a brief overview of lazy learning and case-based reasoning follows; then a case study is described and the results are presented. Four machine learning methods and the outcome of statistical modelling are compared in the results.

2. Prediction methods of species and habitat distribution Studies of species and habitat relations and distribution mainly use methods of statistical description and modelling (Franklin, 1995; Guisan and Zimmermann, 2000). The common methods have been: indices of habitat preference and habitat suitability (Duncan, 1983; Breininger et al., 1991; Hansen et al., 2001), habitat suitability estimation based on difference from the median of species occurrence on the axes of ecological niche factors (Hirzel et al., 2002; Reutter et al., 2003), overlay operations with presence/absence data (Jensen et al., 1992, Brito et al., 1999), canonical correspondence analysis (Hill, 1991), combining probability distributions (Bayesian methods) (Milne et al., 1989; Aspinall, 1992), logistic regression (Austin et al., 1990; Buckland and Elston, 1993; Puttock et al., 1996; Brito et al., 1999; Saveraid et al., 2001), and generalised additive models (Franklin, 1998; Frescino et al., 2001). Decision-tree-based techniques (classification and regression trees) and k-nearest-neighbours (k-NN) methods can be seen either as statistical methods or as AI methods corresponding to the role of machine learning in finding the best fit. Decision trees have been widely used for species and habitat distribution mapping, especially during recent years (Moore et al., 1991; Stankovski et al., 1998; Debeljak et al., 1999, 2001; De’ath and Fabricius, 2000; Kobler and Adamic, 2000; Hansen et al., 2001; De’ath, 2002). Methods based on the use of similar cases have been less frequently used than decision trees and are probably less widely known among ecologists. The decision tree models are the only machine learning methods presented in the overview article on predictive vegetation mapping by Franklin (1995). Guisan and Zimmermann (2000) do not mention machine learning and case-based solutions at all in their overview of predictive habitat distribution models. The most similar examples have been used for modelling the habitat preference of brown bears (Clark et al., 1993), vegetation mapping and prediction (Osborne and Brearley, 2000; Wilds et al., 2000; Remm, 2002). The k-NN method has been applied for forest inventory using remote sensing, field and map data in Finland since 1990 (Muinonen et al., 2001; Tomppo et al., 2002). Similar analogues have also been used in

K. Remm / Ecological Modelling 177 (2004) 259–281

palaeoecological reconstructions (Birks, 1993; Flower et al., 1997). The advantage of k-NN is good performance in predicting multimodal variables, especially if the sample size is large. k-NN has been found to outperform discriminant analysis when the normality, linearity, and identical covariance assumptions are violated (Kiang, 2003). k-NN was also superior compared with the nearest axis-parallel hyper-rectangle algorithm (Wettschereck and Dietterich, 1995). CBR systems have mainly been worked out for decision support, solving mainly classification tasks in weak-theory domains. Empirical observations concerning relationships between the distribution of organisms and their ecological demands dominate over theory in species and habitat distribution mapping also. Statistical models tend to become complicated if numerous nominal and continuous explanatory variables are involved and the dependent variable is multidimensional, like, e.g. a plant community. A case-based system that involves only one explanatory and one dependent variable does not differ so much from a system where there are hundreds of explanatory and dependent variables of different type and explanatory value. Therefore, in view of the role of empirical data and the complexity of relationships in ecological mappings, a case-based approach could have wider application than hitherto.

3. Case-based predictions 3.1. Learning The characteristic feature of intelligence is the ability to learn from experience. Learning is defined as constructing or modifying representations of experiences (Michalski, 1986). Learning and inference can be inductive: that is inferred from single observations, or deductive: that starts with certain premises and concludes with logical consequences of these assertions. Inductive machine learning relies on the theory of dynamic memory (DM) (Schank and Abelson, 1977; Schank, 1982). According to DM theory, decisions in a new situation are made by comparing the new conditions to similar past experience. According to R.C. Schank, the organisation of an effective memory has to be dynamic; it means that learning is mainly the reorganisation of memory structures. Although DM theory

261

is a simplification and has several weaknesses in interpreting human behaviour (Pazzani, 1991; Ramirez and Cooley, 1997), learning from experience has proved to be effective in machine learning and is common in human learning, e.g. professional tennis players train by repeating their actions thousands of times, rather than by studying the physics of spinning tennis balls (Aha and Salzberg, 1993). In AI studies, learning is called machine learning and the knowledge is stored in knowledge bases. A knowledge base can contain: descriptions (feature vectors) of raw examples of previous experience, and generalised knowledge formalised into rules and prototypes. Analogically, the problem-solving strategies can be either, search intensive methods, also called memory intensive methods, or case-based reasoning (CBR), or knowledge-intensive methods, also called model-based or rule-based reasoning (RBR). CBR and RBR are not specific technological solutions, but rather principles for solving problems. The rule-based technologies store their knowledge in a rule-base; these methods include, e.g. decision trees, statistical models, and ANN. The knowledge of a CBR system is stored in a case-base. In CBR, generalisations (concepts) are not represented as features abstracted over exemplars, but rather as typical exemplars without summary information. Supervised classification models and AI systems have to be trained to fit empirical training data. For fitting predictive systems, case-based techniques mainly use lazy learning (called also memory-based learning) algorithms, which keep knowledge in a relatively raw form until the request for information. They merely combine information during problem solving and do not try to generalise the knowledge into models; the intermediate results are discarded (Mitchell, 1997; Wettschereck et al., 1997; Aha, 1998a,b). Lazy learning and case-based systems are open to new knowledge, as new cases can be continuously added to the case-base. There is no need to change a model if additional training data becomes available. Eager learning is the generalising and formalising of knowledge during the first stages of knowledge acquisition, after which raw observations and cases are discarded. The traditional use of signatures, as statistical generalisations of predictable classes, in image processing belongs to eager learning in the sense of machine learning. The effectiveness of eager learning

262

K. Remm / Ecological Modelling 177 (2004) 259–281

is limited by the initial assumptions of models and by the initial data set used for model building. On the other hand, generalisations are better understood and the reliability of successful generalisations outside training data is usually better than that of detailed empirical models. 3.2. Case-based reasoning Case-based reasoning (CBR) is defined as a multi-disciplinary research area and problem solving strategy that reuses previous experiences at a low level of generalisation to evaluate, interpret or solve a current new problem (Aha, 1998a). CBR has also been called exemplar-based reasoning (Kibler and Aha, 1987), instance-based learning (Aha et al., 1991; Zhang et al., 1997), learning by examples (Sazanova et al., 1999), similarity-based reasoning (Hüllermeier, 2001), memory-based reasoning (Stanfill and Waltz, 1986). CBR in a wider sense also involves derivational analogy (Carbonell, 1986, 1990), which recreates the lines of reasoning and accompanying justifications for solving particular problems by using previous cases. Forms of analogical reasoning known under the terms: analogue matching (Birks, 1993; Flower et al., 1997), structure mapping (Gentner, 1983; Falkenhainer et al., 1989), analogical mapping (Ellingsen, 1997), and analogy-based reasoning (Winston, 1986), solve new problems using formalised similarity of past cases from a different domain, while typical case-based methods focus on matching strategies for cases of the same domain. CBR systems are oriented toward problem solving; they retrieve cases, while research in analogy has a broader scope—retrieving also generalised concepts. Analogy based reasoning usually does not stress the difference between lazy and eager learning. CBR is close to pattern recognition; both use existing knowledge and make decisions by similarity. The difference lies mainly in the use of machine learning and in the wider application of expert systems in CBR. Case-based reasoning has proved its strength in many weak-theory domains where the rules are not highly formalised, where a great number of single examples and case studies predominate over deduction, and where large databases of previous cases exist, e.g. medicine (Bareiss, 1989; Althoff et al., 1998; Jurisica et al., 1998; Seitz et al., 1999; Frize and Walker, 2000),

customer support desks (Aha and Breslow, 1997; Lenz and Burkhard, 1997), control of technological processes (Heider, 1996; Suh et al., 1998; Verdenius and Broeze, 1999; Johnson, 2002), military command and control (Liao, 2000), natural resource planning and management (Moeur and Stage, 1995), landscape planning (Navinchandra, 1989), architecture and urban planning (Hua et al., 1994; Yeh and Shi, 2001), text processing and machine translation (Carl, 1997; Güvenir and Cicekli, 1998), speech recognition (Bradshaw, 1987; Jurafsky and Martin, 2000), image processing (Aha and Bankert, 1994; Perner, 1998, 1999, 2001; Ficet-Cauchard et al., 1999; Micarelli et al., 2000), project cost estimation (Finnie et al., 1997; Mair et al., 2000), financial risk management and claims prediction in insurance (Daengdej et al., 1999), web-mining and web use mining (Kwon and Lee, 2003), teaching (Redmond and Phillips, 1997). Case-based methods are well suited to image analysis tasks if a single class is represented by more than one cluster, or a class fills an irregular shape in the spectral space. This happens if objects in different illumination and partly in shadows have to be recognised from an image, or if some classes are naturally heterogeneous or in different stages of development. Mixed forest can have different stand structure and composition; fields can be both recently ploughed or covered with opulent vegetation. Case is a widely accepted term for observations in similarity-based inductive learning. It is reasonable to make a difference between a new observation, called a case, an example in training data, called an instance, and an abstraction or an individual typical instance within a CBR system, called an exemplar (Kibler and Aha, 1987; Bareiss, 1989; Wilson and Martinez, 2000). An exemplar can, in principal, be a detailed description of an observation, a feature vector of some selected characteristics, or a more or less generalised abstraction of instances. In this case study, exemplars are feature vectors of 10 m × 10 m pixels. A case-based prediction system consists of exemplars, characteristics of exemplars—called features, rules for similarity estimation, feature weights, exemplar weights, retrieval indices and/or preclassification, validation and adaption rules. Typically, the solutions have to be modified to compensate for the remaining differences between the known exemplars and a new situation.

K. Remm / Ecological Modelling 177 (2004) 259–281

The similarity of a new case to exemplars in the case-base is estimated in relation to each feature as a partial similarity. Partial similarities are summed up and can be weighted by the indicator value of every feature. Usually the similarities are normalised to be in the range zero (absolutely different) to one (identical cases). The most-similar exemplars are used for further problem solving. The most widely used techniques of exemplar retrieval in CBR systems are database queries searching for at least partially matching descriptions and attribute values within a given range, and variants of k-NN and d-NN methods. More sophisticated algorithms classify instances into clusters and build decision trees, so combining RBR and CBR, or use fuzzy algorithms. The result of fuzzy reasoning is not only the most probable outcome but also a probability distribution of predicted values. The main use of CBR has been the prediction of nominal classes. Continuous variables have mainly been modelled using statistical methods, although machine learning is also powerful in fitting features and exemplars for the prediction of all kinds of numeric variables (e.g. Zhang et al., 1997; Mair et al., 2000). 3.3. Lazy learning The main machine learning tools in a CBR system are: feature selection and weighting, exemplar selection, weighting and generalisation, and learning of exemplar retrieval indices. Learning of a CBR system can be performed automatically or in interaction with a human expert. Both machine learning and statistical methods can be used for finding the best weights of exemplars and features; overviews can be found in Blum and Langley (1997), Dash and Liu (1997), Aha (1998b), Wilson and Martinez (2000), Zhang et al. (2002). Optimisation of a CBR system always adds a certain amount of generalisation to a lazy-learning system. Thus, the difference between lazy and eager learning is not strict. Learning in a CBR system takes place when a new exemplar is added or any of the exemplars is forgotten, when classes or retrieving order is changed, when a similarity measure is changed, and when feature and/or exemplar weights are changed. Comparisons of different learning algorithms suggest that no single method is best for all tasks and data (Wettschereck et al., 1997; Kiang, 2003).

263

Finding the optimal weights is not always guaranteed because an exhaustive search of the feature space is impracticable, since there exist n! possible combinations of n attributes. A well-fitting set of feature weights is usually solved by some variant of gradual hill climbing, which uses predetermined or random length steps. Random samples of features and training instances can also be used in the optimisation of weights (Aha and Bankert, 1994; Maron and Moore, 1997). Partitioning of training instances into samples substantially accelerates learning, although it adds uncertainty to the results. The main issues of learning in a CBR system are briefly described in the next paragraphs. 3.3.1. Feature selection The number of possible explanatory variables used in ecological predictions can be large, especially if various image and ground data are available, e.g. Milne et al. (1989) determined 37 landscape variables for each 3.6 ha cell in the modelling of the habitat distribution of white tailed deer; Luoto et al. (2001) tested 24 environmental variables of 50 m × 50 m grid cells for predicting distribution and abundance of clouded apollo butterfly in S–W Finland; Sachot (2002) used 43 variables for capercaillie and hazel grouse habitat modelling. In real-world situations, relevance of attributes is often not known a priori and the predictive value of a feature depends on which other features are also selected. Many candidate features can be irrelevant to the response variable or redundant in regard to other features. Reducing the number of irrelevant and redundant features not only improves the reliability of prediction but can also drastically reduce the time needed for calculations. There is no universal best method of feature selection, valid for all datasets and modelling tasks, but the basic approaches are forward selection, backward elimination, and the use of expert knowledge. The selection of relevant attributes and the elimination of irrelevant ones is one of the central problems in machine learning, because the number of attributes in data sets given to machine learning is often large, e.g. up to 400 per case (Watson and Perera, 1997). If the data set increases and the phenomena are complex, then more exemplars and features are usually needed in order to get reliable predictions. Langley

264

K. Remm / Ecological Modelling 177 (2004) 259–281

and Iba (1993) demonstrated that the number of training instances needed to reach a given accuracy grows exponentially with the number of irrelevant features. Hierarchical problem decomposition, context guided exemplar retrieval and feature use, preclassification of cases, and local estimation of similarity can significantly speed-up the performance of a predicting system. 3.3.2. Feature weighting Feature weighting can both express different indicator value and account for the collinearity of features. Overviews on feature weighting can be found in Aha (1998b), Wettschereck et al. (1997), Ling and Wang (1997). Feature weights can be tuned using linear programming (Zhang et al., 2002), machine-learning methods, calculated as conditional probabilities, or assigned by an expert. Feature weights can be used both globally and locally—differing in features or in partitions of sample space. In local weighting schemes, the feature space is partitioned and feature weights are dependent on feature values, or an individual similarity profile is assigned to each exemplar (Aha and Goldstone, 1992; Howe and Cardie, 1997; Hüllermeier, 2001). A local weighting scheme can be used in any combination with global weights, and combinations of different scale local weights are possible. 3.3.3. Number of exemplars or kernel extent used for decisions A CBR system can use: (1) only the most similar exemplar, (2) a certain number of similar exemplars (k-NN method), (3) all exemplars above fixed critical similarity (d-NN method), (4) a fixed share of most similar exemplars, (5) local parametric modelling, (6) combinations of the above-mentioned criteria. Prediction using only the most similar exemplar is simple, but sensitive to noise caused by atypical exemplars or erroneous features. The reliability of both d-NN and k-NN prediction depends on the density of observations. The k-NN method gives a prediction even in the absence of reasonably similar exemplars; the d-NN method is not able to give a prediction if there are no exemplars within the extent of critical similarity. Okamoto and Yugami (1997) demonstrated that the optimal value of d for d-NN was almost unchanged if the size of the case-base, and hence the exemplar den-

sity, changed. In contrast, optimal k for k-NN significantly increased as the case-base size increased, and the prediction accuracy depended on the value of k. k is usually a parameter determined by the user of a k-NN method. Researchers have noted the need for automated k optimisation (Wettschereck and Dietterich, 1995; Zhang et al., 1997; Wilson and Martinez, 2000). Locally variable kernel size or local fitting of k might yield a good fit if the data density, the level of noise in data, or the curvature of the underlying function varies locally, although it leads to increased variance in predictions and to the risk of over-fitting the data. 3.3.4. Similarity functions Exemplars that are more similar are assumed to be more reliable predictors, but there is no universal similarity measure for all attributes and for both continuous and nominal variables. The most common methods of deriving the predicted value from similar exemplars are weighted-average (kernel regression) and locally-weighted regression. If the number of features is large, locally-weighted regression needs much larger k than local averaging, at least as large as the number of features. Regression methods also become relatively inefficient if the number of categorical explanatory variables grows. An overview of similarity functions can be found in Wilson and Martinez (2000). A common way to treat continuous predictors is to standardise differences relative to the variability of every explanatory feature. Difference [D(Tp , Ep )] of non-nominal attributes (p) is usually standardised using a normalising parameter: D(Tp , Ep ) =

|Tp − Ep | , V

(1)

where Tp is the value of feature p of a training instance, Ep is the value of feature p of an exemplar, and V is the normalising parameter. A commonly used normalising value is the range of values of the attribute (Lam et al., 2002; Ross et al., 2002; Zhang et al., 2002). Values of the normalising parameter that are less than the range are appropriate if absolute non-similarity is assumed to exist more often than only in comparisons of the maximum and minimum values of the feature. Distance measures are converted to similarities by subtracting these from one. Similarity is assigned a value of 0 if an estimate of distance is greater than 1.

K. Remm / Ecological Modelling 177 (2004) 259–281

The simplest similarity estimation for nominal explanatory variables is to let similarity be equal to one if observations belong to the same class, and zero if the classes are different. It is possible to vary the influence of an attribute according to its indicator value later, by changing the feature weights. More sophisticated rules involve class similarity matrices, or include class frequencies into the similarity estimation before feature weighting, e.g. difference between classes a and b of a nominal feature can be calculated using value difference metric VDM as the sum of square differences of conditional probabilities (Stanfill and Waltz, 1986; Wettschereck and Aha, 1995):  C   N(p, a, c) N(p, b, c) 2 VDMp (a, b) = − , (2) N(p, a) N(p, b) c=1

where N(p, a) is the number of occurrences of feature p in class a, N(p, a, c) is the number of occurrences of feature p in class a, predicting class c, and C is the total number of predictable classes. Another similarity measure for nominal predictors and nominal response variables is mutual information, which measures the reduction of uncertainty in the class of the response variable, given knowledge of the classes and class frequencies of explanatory variables. If a feature provides no information about the classes of the predictable variable, then the mutual information will be zero; if a feature completely determines the class, the mutual information will be proportional to the logarithm of the number of classes (Wettschereck and Dietterich, 1995). 3.3.5. Indexing the case-base The calculation speed in a CBR system is especially sensitive to the number of exemplars and features. In better cases, the irrelevant exemplars and irrelevant features only make the learning and prediction slower, but usually the reliability of prediction also drops due to noise. If the number of exemplars in a case-base is larger than the number of features, as usually is the case, fast retrieval of relevant exemplars is a crucial point in speeding up case-based decision-making. Case-bases are optimised by indexing and preclassification, by generalisation of cases, by learning new and forgetting useless cases, assigning individual similarity profiles to cases, as well as by weighting the cases. Indices represent the most relevant features for retriev-

265

ing potentially relevant existing exemplars. Case-base indices can be hierarchical, forming index trees (Suh et al., 1998). An index that divides cases into subsets before any other estimation of similarity is a preclassification. Preclassification adds domain-specific knowledge to the case-base. 3.3.6. Learning new and forgetting useless exemplars There are two principal directions in the search for the best predictive set of exemplars: incremental and decremental (Wilson and Martinez, 2000). An incremental process starts with an empty set and gradually adds exemplars from training instances or creates abstract prototypes. A decremental process, also called filtering, starts with all instances and gradually filters out untypical and unreliable ones, or generalises instances to prototypes. Removing non-representative exemplars can reduce storage requirement and computational cost while maintaining or even improving the classification accuracy. Forgetting useless exemplars is essential because the cost of searching for matching exemplars quickly becomes prohibitive to the performance of a CBR system. Generally, filtering removes an exemplar if the removing does not reduce the prediction accuracy. The filtering techniques can be classified into three types as: (1) methods retaining most typical central exemplars of classes, (2) methods relying on exemplars on the decision borders of class clusters, (3) methods removing exemplars on class boundaries and treating all others as representative (Wilson and Martinez, 2000). Methods that check classification quality according to the nearest neighbours of exemplars discard exemplars on class boundaries because outliers are seldom correctly classified by their neighbours, e.g. a modified k-NN algorithm proposed by Wilson (1972) removes exemplars misclassified by most of their k-nearest neighbours. The most typical central representatives of classes are maintained if the exemplars that are correctly classified by their neighbours are preferred. However, retaining typical exemplars in class centres will probably cause misclassification in cases of heterogeneous classes, heteroscedastic attributes, and classes of different shape. Methods relying on border exemplars remove an exemplar if other instances are correctly classified without it. Boundary exemplars can describe extremely complex concept spaces but are sensitive to noise and use more

266

K. Remm / Ecological Modelling 177 (2004) 259–281

exemplars than CBR systems that rely on central exemplars. A case-based prediction system can contain observational, constructed, and/or generalised exemplars. If instances of some important feature domains are not available, feature vectors of these can be created using expert knowledge or by generalization, and can be added to the set of observed cases. In general, generalized prototype generation is more powerful for data reduction and abstraction than filtering, especially if truly representative exemplars are not present in the case-base. On the other hand, exemplar reduction by filtering can be more effective in representing the internal, unknown variability of classes. Instance filtering can be performed before, after, or in parallel with feature weighting. 3.3.7. Weighting of exemplars The underlying assumption of exemplar weighting is that some of these are more important or reliable or more generalised than others. The weight of a feature is universal for all cases if weighting is global; if every case can have a different weight then the weighting is called local. Discussions on exemplar weighting can be found in Bareiss (1989), Atkeson et al. (1997), Wettschereck et al. (1997), and Wilson and Martinez (2000). Bareiss (1989) calculated prototypicality as a partial ordering of exemplars within a category. The exemplars with higher ratings were recalled first during classification of a new observation. The most prototypical exemplars had the greatest similarity to other instances within a category. Cost and Salzberg (1993) and also Seitz et al. (1999) assigned weight to an exemplar according to its frequency of use for correct predictions. Different influences of exemplars can also be achieved by allowing the sums of exemplar feature weights to be different (Zhang et al., 2002). 3.3.8. Adaption If the attributes of even the most similar existing exemplar do not match exactly to the new situation, the case has to be adapted. Several adaption methods exist; the simplest technique is not to use adaption at all—the null adaption technique. Parameterised adaption solutions presume existing generalisations that are able to modify the solution according to the differences in case parameters. Adaption by respecialisation

tries to find another and better match to the part of the solution that does not match. Critical adaption uses expert knowledge to assess combinations of features possibly causing problems in a solution proposed by machine learning.

4. Case study 4.1. Location and data Case-based predictions and machine learning methods were tested in a case study based on data from Otepää Upland in Southeast Estonia. The Otepää Upland is an end-moraine accumulative insular height with irregular hilly moraine relief. Big hills are characteristic of the central part of Otepää Upland, and the irregular relief of medium and small-sized hills, of the slopes of the upland. Most of the study area lies within the Otepää Nature Park. Habitat mapping and forest composition modelling in this study area is complicated due to: (1) the high topographical variability of the terrain in Otepää Nature Park, (2) the fact that the extremely mosaic land use pattern yields a large number of mixed pixels in the case of low resolution data, (3) the diverse and gradual forestation of abandoned agricultural lands, (4) the spatially and thematically continuous variability of forest composition from pure coniferous to various mixed stands and to pure broad-leaved stands, (5) the high variability of tree stand coverage as a result of natural forestation and forest management through thinning. 4.1.1. Image data All nine channels of Landsat 7 Enhanced Thematic Mapper (TM) image from May 16, 2000 were used for this investigation as potential explanatory variables. May is the month of foliation of deciduous trees in the study area. The leaves are usually half size in the middle of May. 30 m × 30 m pixels were resampled to the Lambert Conformal Conic projection of the Estonian 1:10,000 base map coordinate system and to a 10 m×10 m pixel size, because this coordinate system and pixel size is used for other data sources prepared for this study. Greyscale orthophotos from the year 1994 and colour digital orthophotos from the year 1999 were

K. Remm / Ecological Modelling 177 (2004) 259–281

also used. Colour orthophotos were available for only about one third of the study area. Pixel size of the orthophotos varies according to map sheet, being in a range of 0.9–1.2 m. Characteristics of orthophotos were calculated both within 10 m and 20–30 m radius zones; characteristics were: local mean intensity, local standard deviation, local spatial autocorrelation of pixel values measured by Moran’s I; from colour photos, also hue and intensity of red, green and blue tone. Colour photos were also converted to greyscale, to calculate greyscale characteristics. All image data layers were resampled or calculated to the same grid of 10 m × 10 m pixels. 4.1.2. Map data An elevation model of the study area was created from 2 m contour lines and spot elevations on 1:10,000 topographical paper maps of the area using Modular GIS Environment software of Intergraph Inc. Relative elevation, as the difference of the focal elevation compared to the mean at 100 m distance, and slope angle in degrees were calculated from the elevation model for every 10 m × 10 m pixel of the study area. Thousands of different soil descriptions given in the database of the Estonian 1:10,000 vector-format digital soil map were generalised into 20 types of soil origin substance and 50 types of soil. Of these generalised types, 15 soil origin and 29 soil types are represented in the study area. Eleven land cover classes were distinguished and rasterised to the same pixel size from the 1:10,000 Estonian digital base map: (1) woodland, (2) field, (3) meadow, (4) other open, (5) greenery within settlements, parks, graveyards, (6) built-up, yards, orchards, (7) roads, artificial and destroyed natural surfaces, (8) scrub, (9) water, (10) fen, (11) bog. Codes of base-map sheets were included into the set of possible explanatory variables for characterising the influence of differences in completion time and technical properties of orthophotos and map sheets. Cartesian coordinates of pixels were also included among the pixel characteristics in order to check the influence of spatial autocorrelation of variables. Merging base map classes “meadow” and “other open” and distinguishing between forest on peat soil and forest on forest mineral soil gave 11 preclassification classes. In calculating case-based predictions, the most similar field observations to a

267

predictable pixel were searched for only within the same class of preclassification. Preclassification enabled the incorporation of the spatially most precise information on land cover borders from the base map. 4.1.3. Field investigation The reliability of any empirical prediction model depends on how representative are the training data and the evaluation data, a fact that is often ignored in image interpretation. Therefore, a set of locations consisting of the centres of 1000 random pixels was created for this study. Three preconditions were followed during the process of generating random locations: (1) only locations where the neighbouring pixels were similar to the focal one were accepted in order to avoid significant prediction errors due to small positional errors; (2) over-representation of land cover classes was blocked by a thematic stratification from a previous habitat map; (3) the location of random points in neighbouring pixels was not accepted. Habitat type, stand coverage, and coverage of main tree species was estimated in every location in June–July–August 2002. Coverage of trees was estimated according to the presumed view from above. That means, the coverage of undergrowth trees and bushes was recorded only to the extent that they were not covered by the upper layers of the canopy. The detailed habitat descriptions were converted to the European Nature Information System (EUNIS) habitat classification (Davies and Moss, 2002). The EUNIS classification system has been worked out to a very detailed level—it contains more than 1500 classification units. Fourteen classes of the second level of EUNIS were distinguished in this study: (1) surface standing waters, (2) raised and blanket bogs, (3) valley mires, poor fens and transitional mires, (4) base-rich fens, (5) mesic grasslands, (6) seasonally wet and wet grasslands, (7) riverine and fen scrubs, (8) broadleaved deciduous woodland, (9) coniferous woodland, (10) mixed deciduous and coniferous woodland, (11) lines of trees, small anthropogenic woodlands, recently felled woodland, early-stage woodland and coppice, (12) arable land and market gardens, (13) domestic gardens of villages and urban peripheries, (14) constructed, industrial and other artificial habitats. The other variables derived from field recordings besides EUNIS classes were: (1) coverage of forest

268

K. Remm / Ecological Modelling 177 (2004) 259–281

stand, (2) coverage of coniferous trees, coverage by tree species: (3) Scots pine (Pinus sylvestris), (4) Norway spruce (Picea abies), (5) birch (Betula pendula + B. pubescens), (6) grey alder (Alnus incana), (7) riverine aspen (Populus tremula), (8) oak (Quercus robur), (9) hazel (Corylus avellana), (10) willows (Salix spp.), (11) stand composition in the form of the stand formula and (12) as a list of the coverages of the main tree species, (13) the dominant tree species (+ class mixed and class open), (14) number of tree species recorded in the stand formula, (15) dominance as the sum of squares of tree shares in stand coverage, and (16) presence/absence of oak in the stand formula. Time intervals between field observations and of different explanatory data are sources of noise for any prediction system. To reduce errors caused by time shift of the source data, field observations in places where the landscape had obviously been changed after the shot time of the Landsat scene, due to new constructions, clear-cuttings, or other abrupt land use changes were excluded from further analysis. Nine hundred thirty-four field observations remained for further analysis. For prediction of forest composition and coverage of tree species, only relatively natural areas consisting of forests, scrubs, fens, and bogs were used. Five hundred sixty-four field observations were located within this part of the study area. 4.2. Case study methods 4.2.1. Software A case-based prediction system has been programmed in MicroSoft Visual Studio.NET at the Institute of Geography of Tartu University. It can be used for learning the set of weights of features and exemplars, and for predicting different types of variables: continuous, multi- and binominal, and complex characteristics, e.g. stand formula of forest. The predictors can include both continuous and categorical variables at the same time. The user can choose between several machine learning methods, types of deviation handling (square or linear), and normalising parameter (mean error, standard deviation, see formula (3)). The results of case-based machine learning predictions were compared to the results of statistical mod-

elling using Statistica 6.0, Statsoft Inc. Classification models for predicting habitat classes and the dominant tree species were formed using generalised linear discriminant analysis. Presence/absence of oak was modelled using generalised logistic regression; the other variables were modelled using generalised linear models. Poisson distribution link was used in prediction of the count of tree species. Generalised models were used because these permit combined use of both continuous and nominal predictors. Complex response variables—stand composition by coverage and stand formula—were not statistically modelled. 4.2.2. Validation Goodness-of-fit of lazy learning predictions was estimated by leave-one-out cross-validation (LOOC). LOOC means that the predicted value for every training instance is calculated by all other instances, leaving this instance out. The statistical prediction methods were evaluated by training-fit that included all data, and by cross-validation. The cross-validation fit of statistical models was calculated using 20 equal size random blocks: 19 blocks for model building and the remaining block for the calculation of validation errors. The predicted and field-estimated values were compared as follows: by index of classification agreement (kappa) if the variable is multinomial (habitat class, dominant tree), by the half-sum of the mean true positive and the mean true negative predictions of probability of classes if the predictions are probabilities of a binominal variable (presence/absence of oak), by the matching share of values if more than one variable was predicted, as in the case of stand composition or stand formula. Other types of numerical predictions were treated as continuous, and the share of variance described by the prediction (R2 ) was the objective function. Since the expected value of kappa and R2 from random data is zero and the value in the case of absolute fit is one, the average truth of predictions of oak presence/absence probability and the similarities of complex variables were transformed to have the same range of values and to make the goodness-of-fit statistics of different type variables comparable. 4.2.3. Estimation of similarity The method of similarity measurement between cases depends on whether the response variable is nominal or numerical, and whether the feature is

K. Remm / Ecological Modelling 177 (2004) 259–281

nominal or numerical. The value difference metric (Formula 2) was used for calculating difference between nominal explanatory classes if the response variable was also nominal. If the predictable variable was continuous and the explanatory feature nominal, then the partial similarity (Sp ) was set depending on the expected rate of coincidence in random combinations of classes: Sp = 1 − F 2 if the feature classes match, and Sp = Fa Fb if the feature classes are different, where F is relative frequency of a nominal value a or b. This gives a higher weight to the matching of rare nominal values. The influence of a nominal attribute on the total similarity between cases is reduced if the matching nominal values are common in the case-base. If all instances had the same value of a nominal attribute, the partial similarity would become zero, although the codes of classes match. The frequency of classes influences the similarity measure according to the expected rate of coincidence in random combinations of classes. In the case of a continuous feature (p), the standardized difference [D(Tp , Ep )] between its values (Tp and Ep ) for an exemplar (E) and a training instance (T) was calculated using formula (1) modified as: D(Tp , Ep ) =

|Tp − Ep | , Vp,precl wE wp

(3)

Vp,precl —a normalisation parameter of feature p within preclassification class precl, wE —weight of exemplar E, wp —weight of feature p. The partial similarity (Sp ) between an exemplar and a training instance regarding a feature p received a value 1 − D(Tp , Ep ) if D(Tp , Ep ) < 1; otherwise Sp = 0. Vp,precl was set equal to two standard deviations of the predictor p within the preclass precl, because predictions of habitat classes with V = 2 were slightly better than with V = 1 or V = 3 standard deviations. The extent of the kernel is larger around exemplars with higher weights and along feature axes with higher weights. The weights of features also regulate the relative effect of partial similarities on the total similarity between an exemplar and an instance. The total similarity was calculated as a weighted average of partial similarities. Only features whose values were known for the instance and also for an exemplar were used in similarity calculations. For example, if at least one of the pixels in a similarity calculation is from a part of the study area for which

269

there is no colour orthophoto, then the characteristics of colour photos do not affect the total similarity. A predefined maximum amount of total similarity was used as a critical value, the crossing of which stopped the search for the next similar exemplars. If exemplars of this similarity were found in the case-base, the similarities of the selected exemplars were summed up. If the total amount of similarity of the selected exemplars was still below the critical, then the similarity level was reduced by 1% and the search was reiterated. Machine learning searched for the best value of the critical total similarity in parallel with learning the best feature weights. The starting value of critical total similarity was one, except for prediction of habitat classes, because trials during the debugging of the machine-learning software revealed that values near 4 gave the best fits of habitat mapping. The predicted value was derived from the selected most similar exemplars as the mean value weighted by similarities if the response variable was treated as continuous. In case of a classification task, the pixel was classified to that class which has the largest sum of similarities among selected most similar exemplars. 4.2.4. Machine learning methods Four methods of machine learning, named LM1, LM7, LM13, and LM21, were compared in this investigation. LM21 starts with instance filtering. All instances and all features have a starting weight equal to one. Only instances whose removal does not improve the goodness-of-fit of the prediction according to LOOC are maintained during filtering. Most instances were excluded from the prediction as useless or noisy during this stage of learning. The next step of LM21 is the changing of feature weights, one at a time, by a random value within −0.5 to +0.5 of the existing weight in 50 iterations. The change that gives maximum improvement of the goodness-of-fit of the prediction or, if none of the changes improved the prediction, minimum reduction of prediction fit is kept for the next iteration. Every time a feature weight is changed, all feature weights are standardised to keep the mean value of weights equal to one in all four learning methods used in the methodological comparisons of this study. LM7 uses instance filtering and feature weighting in reverse order compared to LM21. The first step of

K. Remm / Ecological Modelling 177 (2004) 259–281

LM7 is stepwise inclusion of features with proportionally decreasing weights. The weight of the best predictive feature is 1, and of the 20th best feature, 0.05, when included. The weights are re-standardised after every change; therefore, the weight of predictors included at the beginning increases above one. The best set of weights of features is used in the second step, which is instance filtering. Here, the change of prediction quality was tested by tentative one by one omission of exemplars. If the presence of an exemplar in the case-base did not improve prediction quality, the exemplar was excluded from further use. LM13 is an advanced modification of LM7. The enhancements are: (1) additional refinement of the feature weights during 30 iterations, (2) exemplar weighting instead of instance filtering. Useless instances are not simply excluded during weighting, but individual weights, in real numbers, are given to every exemplar as an assessment of its best influence on the prediction. A zero weight means exclusion of the instance from the set of exemplars. Exemplar weighting starts at weights equal to one for all training instances. Then the weights of all single exemplars are changed one by one, first by 1, then by 0.5, and then by 0.25. LM1 includes iterative random sampling. The first stage of the process is stepwise inclusion of features using a new random sample of instances at every step. The weight of features decreases proportionally according to the order of addition. The second step is the search for better values for feature weights in gradually enlarging random samples, from 0.3 of the volume of the training set to the full set of instances, within 18 iterations. The last step is instance filtering, as in processes LM7 and LM21. The use of samples enables the learning process to be speeded up and enables feature combinations to be reached that cannot be found in determined all-data processes. In principle, a random process can even reach a fitness peak surrounded by a circular valley. The LM1 process was repeated 25 times with every response variable. The best set of exemplars and features out of the 25 results was further improved by exemplar weighting. The best weights were stored separately for all prediction tasks and learning methods. These different weight sets were used in combination with the same case-base and rasterised data layers for creating prediction maps.

5. Results and discussion 5.1. Comparison of learning methods The learning methods LM21 and LM7 differ in the order of applying feature weighting and instance filtering. LM21 (feature-first) reaches a relatively high goodness-of-fit level after instance filtering. Feature weighting takes about the same time but does not improve the fit remarkably (Fig. 1). Presumably, the learning curves depend on the number of features, on the number of exemplars, and on the ratio of both. Enough possible combinations of exemplars exist to select a well predicting set, as long as the number of exemplars is large. The effect of feature weighting and exemplar selection on the goodness-of-fit was better balanced in learning method LM7. The final best-fit levels of LM7 and LM21 do not differ significantly. The learning results of LM21 are more stable than those of LM7, although LM21 is more time-consuming. Both methods are only filtering exemplars into “useful” and “useless” and are not weighting exemplars. If learning time is not limiting then exemplar weights can improve the predictions remarkably—learning methods without exemplar weighting never reached a better goodness-of-fit of the same response variable than methods involving exemplar weighting (Table 1). Learning method LM13 reached the best results in the case of 10 dependent variables out of 17. If the outcome was not the best, it was still among the 75

CF

65 2 R [%]

270

LM21

FW

55 45

CF

35

LM7 SW - in

25 0

10

20

30

40

50

time [min] Fig. 1. Learning curves of methods LM7 and LM21 for coverage of Pinus sylvestris. SW-in: stepwise addition of features, CF: instance filtering, FW: feature weighting.

Table 1 Goodness-of-fit of predictions expressed as % Variable (objective function)

Generalised regression or LDA

25 × LM1 (SW-in, RS ≥ SE, FW ≥ CF)

+ EW of the best

LM7 (SW-in ≥ CF)

LM13 (SW-in ≥ FW ≥ EW)

LM21 (CF ≥ FW)

Fit

Time

Fit

Fit

Time

Fit

Time

Fit

Time

82 (80–85) 50 (47–53)

31–140 22–50

85 58

84 51

50 21

85 56

148 46

82 49

11–39

72

60

12

72

26

22 (16–28)

16–40

41

17

24

17

48 71 54 58 48 52 24 25 35 32 31

(39–56) (66–74) (44–63) (55–62) (38–56) (43–63) (17–35) (15–37) (0–54) (21–44) (19–40)

17–37 18–49 14–33 18–36 20–42 14–33 13–37 21–39 11–29 17–33 14–40

66 79 72 70 61 69 46 44 57 49 46

55 74 60 64 55 58 12 15 39 45 43

22 19 20 19 21 21 17 23 24 23 25

44 (41–46)

31–53

50

46

46 (42–48)

14–60

50

47

Binominal (true positive + true negative − 1) Quercus presence/absence 50 (38–58)

Training fit

CV fit

117 27

79 40

75 31

42

31

12

0

50

34

47

34

7

65 80 69 73 65 69 12 16 40 57 55

93 46 67 76 75 48 26 37 62 71 77

48 72 60 60 52 56 37 34 37 36 38

33 49 46 58 43 47 46 32 43 50 46

58 68 55 56 40 53 29 35 24 40 41

32 51 19 30 14 24 0 0 0 13 13

25

51

116

46

57

a

a

24

52

113

48

58

a

a

(R2 )

Count No. of tree species recorded (R2 )

Continuous Stand cover Coniferous cover Pinus cover Picea cover Betula cover Alnus cover Populus cover Corylus cover Quercus cover Salix cover Dominance Complex (st matching share) Coverage of all tree species including open share Stand formula

K. Remm / Ecological Modelling 177 (2004) 259–281

Nominal (kappa) EUNIS habitats Dominant tree

Lazy learning

Fit of machine learning methods is leave-one-out cross-validation fit; fit of statistical models is estimated from cross-validation (CV) using 20 equal-size random samples; fit of LM1 is given as mean (minimum–maximum), learning time in minutes (2 GH processor); the best fit is in bold. Generalised linear discriminant analysis (LDA) is used in statistical models of categorical data, logistic regression for binomial data, generalised linear regression (GLR) with Poisson link for count data and with truncated normal link for coverage. SW-in: stepwise addition of features, RS: random samples, SE: sample enlarging, CF: instance filtering, EW: exemplar weighting, FW: feature weighting. LM1, LM7, LM13, LM21 denote machine learning methods (descriptions in text). a Not modelled. 271

272

K. Remm / Ecological Modelling 177 (2004) 259–281 75

75

SW- in 65

LM1 55 EW

2 R [%]

2 R [%]

65

45 35

0

10

EW

FW 45

FW

CF

SW - in

25

55

LM1

35 20

30

40

50

60

time [min] Fig. 2. Learning curve of method LM13 for coverage of Pinus sylvestris. SW-in: stepwise addition of features, FW: feature weighting, EW: exemplar weighting.

best for a particular response variable. The success of the method may be attributable to a relatively well-balanced effort on feature selection, feature weighting, and exemplar weighting, at least for this data-set (Fig. 2). The time taken for learning depends on the number of instances and features selected to a feature set before instance filtering, or on the number of selected exemplars if instance filtering is the first stage of learning. Mostly, method LM1 is fast enough, making its multiple repetition realistic. Exemplar weighting, which is not a part of the LM1-process, is usually time consuming and therefore the improvement of fit through exemplar weighting was calculated using only the best result out of 25 repetitions of the LM1-process (Fig. 3). Method LM1, which starts from random samples, reached the best fit after 25 repetitions and additional exemplar weighting for seven response variables (Table 1). 5.2. Predictions The predictive ability of statistical models was inferior to the results of lazy learning for all variables. The best match of prediction and field observation according to LOOC-validation was in the case of the habitats (R2 fit 85%) and total coverage of coniferous trees (80%). It was possible to achieve a better prediction for the total coverage of coniferous trees than for the coverage of the main coniferous trees separately (P. sylvestris 72%, P. abies 73%), and a clearly better prediction than for the coverage of any deciduous tree

25

0

10

20

30

40

time [min]

Fig. 3. Learning curve of method LM1 for coverage of Pinus sylvestris. SW-in: stepwise addition of features in a random sample (squares), FW: sample enlarging and feature weighting (triangles), CF: instance filtering, EW: exemplar weighting.

species or the stand total coverage (Table 1). Both the Landsat image and the orthophotos used in this study have been taken in spring when the leaves of deciduous trees are still small, their coverage and hue depends on local microclimatic conditions, and the estimation of their coverage from these images can not be expected to be reliable, even if it is supported by soil and topographical data. Coverage of trees occurring sparsely (P. tremula, Salix spp.) or mainly in undergrowth (C. avellana) can not be reliably estimated from images, nor does the abundance of these species strictly coincide with any soil or topographical property of a site. Other authors have expressed the belief that modelling habitat suitability for a rare species by the prediction of presence/absence is generally more reliable than modelling their abundance (Bonn and Schröder, 2001; Frescino et al., 2001). Abundance of species is usually dependent on inherent population dynamics and is less stable than the presence/absence measure. Eighty percent fit of a generalised additive model has been reported also by Frescino et al. (2001) in predicting the presence of lodgepole pine (Pinus contorta) in the Uinta Mountains, USA. The predictions of tree basal area and shrub coverage by the same authors were not so precise. The predictions of complex variables that depend on several species were not good. The best predictive weights produced predictions describing less than 60% of the variability of response variables: stand

K. Remm / Ecological Modelling 177 (2004) 259–281

40 frequency [%]

composition, stand formula, number of tree species recorded, and the dominant tree species. The prediction of dominant tree species was the best (58%) among the multi-species variables and the prediction of the number of tree species recorded in the stand formula was the least precise (41%). Different predictors are relevant in recognising different trees from images, and the best way to glean forest composition from image and map data has probably not yet been found. The upper limit of the maximum fit of a model is determined by the precision of measurements and by the error distribution of explanatory variables. The share of tree species was estimated visually and recorded in the stand formula on a 10-point scale. Therefore, the expected mean error in field estimates of coverage is probably about 10%. Also, there is always some amount of positional and classification imprecision in all map data, and discrepancies between the existing map data and field observations, which reflect differences in time, methodology, and the scale of analysis, are usual (Wu and Smiens, 2000). Vector format soil maps are a characteristic example of a continuous variable that is generalised into thematic classes and segmented into polygons, the boundaries of which usually cannot be followed in nature. The use of nominal characteristics, like soil type, adds some amount of confusion to the predictions if the variable is naturally continuous. Image data is vulnerable to various spatial distortions caused, e.g. by topography, by variations in atmospheric conditions, by the properties of sensors, and by effects of image pre-processing. Suspicions may arise, considering the power of machine learning and the risk of over fitting, that certain goodness-of-fit, at least on the level of less successful machine learning methods, could be obtained also from random data. I tested this by randomly replacing the observed values of response variables between observations 20 times. The randomised case-base was optimised using method LM13. Randomisations did not give a prediction fit above the best fit of real observations and, in most cases, learning from randomised data yields an apparent explanation near zero (Fig. 4). Machine learning from randomised data was more successful in the case of nominal dependent variables. Only learning of randomised oak coverage and presence/absence data was sometimes more successful than that achieved by the less successful method of machine learning from the observed data. Therefore,

273

30 20 10 0

-5

0

5

10 15 20 25 30 35 40 45 goodness-of-fit [%]

Fig. 4. Relative frequency of goodness-of-fit values from randomised data obtained by method LM13.

the goodness levels of case-based predictions reported in Table 1 are unlikely to be a result of over fitting of irrelevant features and of assigning negligible weights to components of the case-based system. 5.3. Feature weights Features do not predict directly in case-based calculations. The weights of features merely decide which exemplars are selected to be among the most similar ones and which differences in features are more relevant in deciding similarity of cases. For example, if map sheet code is used as a feature, then exemplars from the same map sheet as a predictable pixel are better predictors than exemplars from other sheets. This may be explained by spatial autocorrelation of phenomena, by varying technical characteristics, and by differences in reliability of source data from different map sheets. The different predictive value of instances, as well as the final weights of features in the configurations selected by machine learning, is a question for a further analysis. Experimenting with machine learning methods indicated that sometimes quite different sets of features and exemplars can result in nearly equally good predictions. The task of this study is only to demonstrate the possibility of interpreting the weights as indicator values of the features. Although the 17 response variables used in this study is not a balanced and representative set of possible dependent variables, some tendencies do appear. The average weights of all explanatory variables in the best predictive sets, standardised so that mean value = 1, range from 0.37 to 2.70. All features occurred in

274

K. Remm / Ecological Modelling 177 (2004) 259–281

(A)

(B)

(C)

(D)

Fig. 5. Weights of features standardised to have mean value = 1 in the best sets of prediction: (A) average weights of 17 variables; (B) habitat classes; (C) total coverage of coniferous trees. (D) presence/absence of oak trees in stand formula. n: mean value of neighbouring pixels, E–W: coordinate in east–west direction, N–S: coordinate in north-south direction.

at least one of the best predictive sets. The most valuable predictor was land cover category from the base map (Fig. 5A). Coarser scale information from Landsat TM channels was approximately as valuable as the detailed radiance intensity of colour separations from colour orthophotos. Grey intensity from Landsat TM channel 8 was preferred to the intensity of grey on orthophotos. One reason for the higher value of satellite information is the relatively stable and comparable reliability of reflectance values, at least within one scene. Relatively low indicator value of information from greyscale orthophotos can, in this case, be attributed to the varying quality of photos from the 1990s. Both darkness and contrast varies between neighbouring orthophoto sheets, and within a sheet, partly because one orthophoto sheet is usually a compilation of different aerial photo images.

Explanatory variables of both the same pixel and of neighbouring pixels have been chosen to the best predictive feature sets. The coexistence of local and neighbourhood characteristics can be related to the problem of finding the proper scale for landscape ecological predictions and models. The results of these lazy learning processes suggest that quite often a combination of local data and neighbourhood data sources, or different scale explanatory variables, offers the most reliable predictions. It has been proved that local variances of image texture indicate both canopy coverage and spatial pattern of crown size (Coops and Culvenor, 2000). Stand age and spatial structure were not among the 17 variables predicted in this study. Autocorrelation and local standard deviation of pixel lightness was valuable, e.g. for predicting habitat classes and the presence/absence of

K. Remm / Ecological Modelling 177 (2004) 259–281

oak in the stand formula (Fig. 5B and D). On average, the indicator value of these characteristics of local pattern was less than average (Fig. 5A). Soil characteristics were generally better predictors than slope angle and relative elevation. For example, the best prediction of presence/absence of oak was calculated predominantly using the characteristics of soil and spatial autocorrelation: map sheet and the x and y coordinates (Fig. 5D). Spatial autocorrelation here expresses the higher reliability of comparisons of nearby pixels and the higher probability of oak presence or absence if oak was respectively recorded or not recorded at nearby stations during field observations. Cartesian coordinates received a high rating in the prediction of total coverage of coniferous trees, but, in contrast to habitat classes and oak, machine learning considered soil characteristics not to be relevant in the estimation of coniferous coverage using exemplars (Fig. 5C). Instead, red, infrared, and lightness of grey channel from Landsat TM, and red and blue tone of aerial photos were selected among other features. The suggestion from these machine learning results is that observations from different soil types can successfully be compared and exemplars from a different soil type can be used for the estimation of total coverage of coniferous trees.

275

ing and weighting. Only the share of exemplars in the best predictive set of total coverage of coniferous trees exceeded half, being 290 (51%) out of 564 training instances. The number of exemplars in the best predictive sets of other multispecies variables was also relatively high: stand coverage (258), stand formula (284), stand composition incl. open share (278), dominance (256). Variables predicted by a relatively small number of exemplars were: coverage of oak (33 exemplars, 0.05% of instances), presence/absence of oak (77), and pine coverage (102). The number exemplars per habitat class, selected by the learning process, varies approximately proportionally to the number of instances and the total area of habitat categories in the study area (Fig. 6). More exemplars were needed to represent common habitat classes like coniferous, mixed, and deciduous forests, and mesic and wet grasslands, probably because the characteristics of these widely distributed and heterogeneous classes are highly variable. The location exemplars in the multidimensional feature space cannot be depicted on a 2D surface. A common practice is to illustrate objects in the feature space by projecting their location to a two-dimensional feature surface. An example is the location of training instances and exemplars selected to the best predictive set for prediction of habitat classes relative to Landsat TM channel 2 and channel 5 (Fig. 7). Selected exemplars tend to be placed along the boundaries of categories. This is easily visible for agricultural land, exemplars of which form a line half-around the training

5.4. Exemplar weights On most occasions, less than a half of training instances were retained as exemplars after instance filter300

excluded included 200

constructed

garden

arable land

recently felled wood

mixed wood

coniferous wood

deciduous wood

scrub

wet grassland

mesic grassland

fen

bog

surface water

0

transition mire

100

Fig. 6. Number of training instances and exemplars selected by learning method LM13 for prediction of habitat categories.

276

K. Remm / Ecological Modelling 177 (2004) 259–281

2.6 2.4 2.2 2 1.8 1.6 1.4 TM2

1.2

lands. The learning process did not select some outlying observations of grasslands into the set of exemplars. Extremely outlying exemplars of one class may well be erroneous and cause misclassification of other classes.

Training instances arable land mesic grassland wet grassland other training instances

Exemplars arable land mesic grassland

6. Conclusions

wet grassland

1 0.8 0.6 0.4 0.2 0

-0.2 -0.4 -0.6 -0.4-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 TM5

Fig. 7. Location of training instances and exemplars for prediction of habitat classes in normalised 2D projection of feature space (Landsat TM channel 2 and 5). Units on axes are standard deviations.

observations defined as agricultural land. The learning process did not select exemplars from the side of larger values of the two dimensions because there are no observations from competitive habitat classes within the preclassification unit “agricultural land and meadows”. Mesic and wet grasslands are large and heterogeneous classes at various stages of land use, land abandonment, and spontaneous forestation. Soil properties mainly define the difference between mesic and wet grasslands, and the difference is not expected to be easily visible from image data. The instances of mesic grasslands are mostly located in two clusters in the part of the feature space where the ratio of reflectance values TM5/TM2 is higher. This might be attributable to the faster development of grass on dry soils in spring and the larger relative reflectance in the interval of TM channel 2 representing visible green colour. The cluster of wet grassland observations having large values on both axes represents predominantly ameliorated but abandoned grasslands on peat soil. The other cluster consists mainly of observations on gleic soils. Exemplars of mesic grassland are located mainly along the boundary with wet grass-

Case-based methods are promising for processing images and other spatial data and could be among the best methods for habitat mapping, predicting spatial distribution of species, and modelling habitat suitability, especially if a large number of different data sources is available and empirical training observations exist. Case-based methods have advantages over other methods if the response variable has a multimodal distribution relative to some explanatory variables, if the predictable classes are heterogeneous, if the training data set grows continuously, if both numerical and nominal features have to be combined, and if values of multiple response variables have to be predicted. A case-based prediction system consists of many interacting components: rules for similarity measurements, exemplars, weights of exemplars and weights of features, indexes or preclassification, and rules for operation. Validation of these parts separately is difficult. The evaluation of the system as a whole is easily possible by validating prediction results. Continuous leave-one-out cross-validation of the prediction goodness-of-fit can easily be programmed into a machine learning process. CBR does not yield concise representations of concepts that can be easily understood by humans, but CBR has advantages over RBR in complex situations. It is quite often impracticable or even impossible to specify multiple dependences and complex variables fully by rules, but examples of solutions can still be given. CBR encapsulates detailed knowledge in exemplars. The exemplars can be interpreted differentially, whereas once a rule has been derived from instances, new derivation of the rule is needed for re-interpretation. A general limitation of all empirical modelling methods, including CBR systems, is the threat of inconspicuous over-fitting, which leads to instability of predictions and unreliability of extrapolations.

K. Remm / Ecological Modelling 177 (2004) 259–281

Two machine learning methods produced exemplar and feature weights for the best predictions. One of these (LM13) consists of: (1) stepwise inclusion of features with proportionally decreasing weights, (2) additional learning of feature weights during 30 iterations, (3) exemplar weighting. LM13 gave the largest number of the best results and outperformed other one-run learning methods. The other efficient learning method (LM1) uses gradually increasing random samples of training instances. The results of sample-based learning differ in runs. The best fit for seven response variables out of 17 was achieved after adding exemplar weighting to the best set of cases and feature weights from 25 repetitions of the LM1 process. Case-based predictions are not the decisions of a black-box. The weights of features can be used as estimators of the indicator value of explanatory variables. The weights of cases may be interpreted as estimations of the reliability or typicality of one or another observation. The weights of features in the best predicting sets of this study refer to the overall high value of land-cover categories from the base map, to the importance of soil data for some variables, and to the lower usefulness of grayscale orthophotos compared to colour photos and to the visible and near infra-red channels of Landsat TM. Characteristics of the same pixel and of the neighbourhood have both been used in the best predictive feature sets. Features describing autocorrelation in variables: coordinates of location and map sheet code, should also not be ignored.

Acknowledgements The author is grateful to Michael Bock from the German Aerospace Centre and to Andrej Kobler from the Slovenian Forestry Institute for inducing communications, to students Kaupo Kohv and Jaanus Remm for careful field work, to two anonymous referees, and to Ilmar Part for revising the manuscript. This study was supported by European Union 5th Framework project SPIN (Spatial Indicators for European Nature Conservation) and grant 5261 of Estonian Science Foundation. Soil data, orthophotos, and base map are used in accordance with the Estonian Land Board licences 46, 107, and 334.

277

References Aaviksoo, K., Paal, J., Dišlis, T., 2000. Mapping of wetland habitat diversity using satellite data and GIS: an example from the Alam-Pedja Nature Reserve. Estonia. Proc. Estonian Acad. Sci Biol. Ecol. 49 (2), 177–193. Aha, D.W., 1998a. The omnipresence of case-based reasoning in science and application. Knowledge-Based Systems 11, 261– 273. Aha, D.W., 1998b. Feature weighting for lazy learning algorithms. In: Liu, H., Motoda, H. (Eds.), Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publisher, Norwell, MA, pp. 13–32. Aha, D.W., Bankert, R.L., 1994. Feature selection for case-based classification of cloud types: An empirical comparison. In: Aha, D.W. (Ed.), Case-Based Reasoning: Papers from the 1994 Workshop. AAAI Press, Menlo Park, CA, pp. 106–112. Aha, D.W., Breslow, L.A., 1997. Refining conversational case libraries. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning Research and Development. Proceedings of the ICCBR-97, LNAI. Springer, Berlin, pp. 267–278. Aha, D.W., Goldstone, R.L., 1992. Concept learning and flexible weighting. In: Proceedings of the 14th Annual Conference of the Cognitive Science Society. Bloominghton, IN, Lawrence Erlbaum, London, pp. 534–539. Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Machine Learn. 6, 37–66. Aha, D.W., Salzberg, S.L., 1993. Learning to catch: applying nearest neighbor algorithms to dynamic control tasks. In: Proceedings of the Fourth International Workshop on Artificial Intelligence and Statistics. Ft. Lauderdale, FL, pp. 363– 368. Althoff, K., Bergmann, R., Wess, S., Manago, M., Auriol, E., Larichev, O.I., Bolotov, A., Zhuravlev, Y.I., Gurov, S.I., 1998. Case-based reasoning for medical decision support tasks: the Inreca approach. Artif. Intell. Med. 12, 25–41. Aspinall, R.J., 1992. An inductive modelling procedure based on Bayes’ theorem for analysis of pattern in spatial data. Int. J. Geogr. Inform. Systems 6 (2), 105–121. Atkeson, C., Moore, A., Schaal, S., 1997. Locally weighted learning. Artif. Intell. Rev. 11 (1–5), 11–73. Atkinson, P.M., Lewis, P., 2000. Geostatistical classification for remote sensing: an introduction. Comput. Geosci. 26, 361– 371. Austin, M.P., Nicholls, A.O., Margules, C.R., 1990. Measurement of the realized quantitative niche: environmental niche of five Eucalyptus species. Ecol. Monogr. 60 (2), 161–177. Bareiss, R., 1989. Exemplar-Based Knowledge Acquisition: A Unified Approach to Concept Representation, Classification, and Learning. Perspectives in Artificial Intelligence. vol. 2. Academic Press, San Diego, London. 169 pp. Birks, H.J.B., 1993. Quaternary palaeoecology and vegetation science—current contributions and possible future developments. Rev. Paleaobot. Palynol. 79 (1–2), 153–177. Blum, A.L., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271.

278

K. Remm / Ecological Modelling 177 (2004) 259–281

Bonn, A., Schröder, B., 2001. Habitat models and their transfer for single and multi species groups: a case study of carabids in an alluvial forest. Ecography 24, 483–496. Bradshaw, G., 1987. Learning about speech sounds: the NEXUS project. In: Proceedings of the Fourth International Workshop on Machine Learning. Morgan Kaufmann, Irvine, CA, pp. 1– 11. Breininger, D.R., Provancha, M.J., Smith, R.B., 1991. Mapping Florida Schrub Jay habitat for purposes of land management. Photogramm. Eng. Remote Sens. 57 (11), 1467–1474. Brito, J.C., Crespo, E.G., Paulo, O.S., 1999. Modelling wildlife distributions: logistic multiple regression vs overlap analysis. Ecography 22, 251–260. Buckland, S.T., Elston, D.A., 1993. Empirical models for the spatial distribution of wildlife. J. Appl. Ecol. 30, 478–495. Carbonell, J.G., 1986. Derivational analogy: a theory of reconstructive problem solving and expertise acquisition. In: Michalski, R.S., Carbonell, J.G., Mitchell, T.M. (Eds.), Machine Learning. An Artificial Intelligence Approach. Volume II. Morgan Kaufmann. Los Altos, CA, pp. 371–392. Carbonell, J.G., 1990. Derivational analogy: a theory of reconstructive problem solving and expertise acquisition. In: Shavlik, J.W., Dietterich, T.G. (Eds.), Readings in Machine Learning. Morgan Kaufmann, Palo Alto, CA, pp. 247–288. Carl, M., 1997. Case composition needs adaptation knowledge: a view on EBMT. In: Wettschereck, D., Aha D. (Eds.), ECML-97 MLNet Workshop Notes Case-Based Learning: Beyond Classification of Feature Vectors. NCARAI Technical Note AIC-97-005, pp. 17–24. Clark, J.D., Dunn, J.E., Smith, K.G., 1993. A multivariate model of female black bear habitat use for a geographic information system. J. Wildlife Manage. 57 (3), 519–526. Coops, N., Culvenor, D., 2000. Utilizing local variance of simulated high spatial resolution imagery to predict spatial pattern of forest stands. Remote Sens. Environ. 71, 248–260. Cost, S., Salzberg, S., 1993. A weighted nearest neighbour algorithm for learning with symbolic feature. Machine Learn. 10, 57–78. Daengdej, J., Lukose, D., Murison, R., 1999. Using statistical models and case-based reasoning in claims prediction: experience from a real-world problem. Knowledge-Based Systems 12, 239–245. Dash, M., Liu, H., 1997. Featute selection for classification. Intell. Data Anal. 1, 131–156. Davies, C.E., Moss, D., 2002. EUNIS habitat classification 2001 work programme final report. European Environmental Agency. European Topic Centre on Nature Protection and Biodiversity. De’ath, G., 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83 (4), 1105–1115. De’ath, G., Fabricius, K., 2000. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81 (11), 3178–3192. Debeljak, M., Džeroski, S., Adamiˇc, M., 1999. Interactions among the red deer (Cervus elaphus L.) population, meteorological parameters and new growth of the natural regenerated forest in Sneznik, Slovenia. Ecol. Model. 121, 51–61.

Debeljak, M., Dzeroski, S., Jerina, K., Kobler, A., Adamic, M., 2001. Habitat suitability modelling for red deer (Cervus elaphus L.) in South-central Slovenia with classification trees. Ecol. Model. 138, 321–330. Duncan, P., 1983. Determinants of the use of habitat by horses in a Mediterranean wetland. J. Anim. Ecol. 52, 93–109. Ellingsen, B.K., 1997. Distributed representations for analogical mapping. In: Wettschereck, D. Aha, D. (Eds.), ECML-97 MLNet Workshop Notes Case-Based Learning: Beyond Classification of Feature Vectors. NCARAI Technical Note AIC-97-005, pp. 25–32. Falkenhainer, B., Forbus, K., Gentner, D., 1989. The structuremapping engine: algorithm and examples. Artif. Intell. 41, 1– 63. Ficet-Cauchard, V., Porquet, C., Revenu, M., 1999. CBR for the reuse of image processing knowledge: a recursive retrieval/adaptation strategy. In: Althoff, K.-D., Bergmann, R., Branting, K.L. (Eds.), Case-Based Reasoning Research and Development. LNAI 1650. Springer, Berlin, pp. 438–453. Finnie, G.R., Wittig, G.E., Desharnais, J.-M., 1997. Estimating software development effort with case-based reasoning. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning Research and Development. Second International Conference on CaseBased Reasoning, ICCBR-97, LNAI 1266. Springer, Berlin, pp. 13–22. Flower, R.J., Juggins, S., Battarbee, R.W., 1997. Matching diatom assemblages in lake sediment cores and modern surface sediment samples: the implications for lake conservation and restoration with special reference to acidified systems. Hydrobiologia 344, 27–40. Franklin, J., 1995. Predictive vegetation mapping: geographic modelling of biospatial patterns in relation to environmental gradients. Progr. Phys. Geogr. 19, 474–499. Franklin, J., 1998. Predicting the distribution of shrub species in southern California from climate and terrain-derived variables. Journal of Vegetation Science 9, 733–748. Frescino, T.S., Edwards Jr., T.C., Moisen, G.G., 2001. Modelling spatially explicit forest structural attributes using generalized additive models. Journal of Vegetation Science 12, 15–26. Frize, M., Walker, R., 2000. Clinical decision-support systems for intensive care units using case-based reasoning, Medical Engineering. Medical Engineering & Physics 22, 671–677. Gentner, D., 1983. Structure mapping: a theoretical framework for analogy. Cognitive Psychology 25 (4), 431–467. Guisan, A., Zimmermann, N.E., 2000. Predictive habitat distribution models in ecology. Ecological Modelling 135 (2– 3), 147–186. Güvenir, H.A., Cicekli, I., 1998. Learning translation templates from examples. Information Systems 23 (6), 353–363. Hansen, M.J., Franklin, S.E., Woudsma, C.G., Peterson, M., 2001. Caribou habitat mapping and fragmentation analysis using Landsat MSS, TM, and GIS data in the North Columbia Mountains, British Columbia, Canada. Remote Sens. Environ. 77, 50–65. Heider, R., 1996. Troubleshooting CFM 56-3 engines for the Boeing 737 using CBR and data-mining. In: Smith, I., Faltings, B. (Eds.), Advances in Case-based Reasoning. Proceedings of the Third European Workshop on Advances in Case-Based

K. Remm / Ecological Modelling 177 (2004) 259–281 Reasoning. Lausanne, Switzerland, November 14–16 1996. LNAI 1168. Springer, pp. 512–518. Hill, M.O., 1991. Patterns of species distribution in Britain elucidated by canonical correspondence analysis. J. Biogeogr. 18, 247–255. Hirzel, A.H., Hausser, D., Chessel, D., Perrin, N., 2002. Ecologicalniche factor analysis: how to compute habitat-suitability maps without absence data? Ecology 83 (7), 2027–2036. Howe, N., Cardie, C., 1997. Examining locally varying weights for nearest neighbor algorithms. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning Research and Development. Proc. ICCBR-97, LNAI. Springer, Berlin, 455–466. Hua, K., Smith, I., Faltings, B., 1994. Integrated case-based building design. In: Wess, S., Althoff, K.-D., Richter, M.M. (Eds.), Topics in Case-Based Reasoning. Proceedings of the First European Workshop, EWCBR-93 Kaiserslautern. Germany, Springer, Berlin, November 1–5, 1993, pp. 438–445. Hüllermeier, E., 2001. Similarity-based inference as evidential reasoning. In: Horn, W. (Ed.), Proceedings of ECAI-2000, 14th European Conference on Artificial Intelligence. IOS Press, Berlin, Germany, pp. 50–54. Jensen, J.R., Narumalani, S., Weatherbee, O., Morris, K.S., 1992. Predictive modelling of Cattail and Waterlily distribution in a South Carolina Reservoir using GIS. Photogramm. Eng. Remote Sens. 58 (11), 1561–1568. Johnson, C., 2002. Software tools to support incidence reporting in safety-critical systems. Safety Sci. 40, 765–780. Jurafsky, D., Martin, J.H., 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall Series in Artificial Intelligence XXVI, Upper Saddle River, NJ, 934 pp. Jurisica, I., Mylopoulos, J., Glasgow, J., Shapiro, H., Casper, R.F., 1998. Case-based reasoning in IVF: prediction and knowledge mining. Artif. Intell. Med. 12, 1–24. Kiang, M.Y., 2003. A comparative assessment of classification methods. Decision Support Systems 35, 441–454. Kibler, D., Aha, D.W., 1987. Learning representative exemplars of concepts: an initial case study. In: Proceedings of the Fourth International Workshop on Machine Learning. UC-Irvine, June 1987, pp. 24–29. Kobler, A., Adamic, M., 2000. Identifying brown bear habitat by combined GIS and machine learning method. Ecol. Model. 135, 291–300. Kwon, O.-W., Lee, J.-H., 2003. Text categorization based on knearest neighbor approach for Web site classification. Inform. Process. Manage. 39, 25–44. Lam, W., Keung, C.-K., Ling, C.X., 2002. Learning good prototypes for classification using filtering and abstraction of instances. Pattern Recogn. 35, 1491–1506. Langley, P., Iba, W., 1993. Average-case analysis of a nearest neighbour algorithm. In: Proceedings IJCAI-93. Chambery, France, 889–894. Lenz, M., Burkhard, H.-D., 1997. CBR for document retrieval: the FALLQ project. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning Research and Development. Proceedings of the ICCBR-97, LNAI. Springer, Berlin, pp. 84–93.

279

Liao, S., 2000. Case-based decision support system: architecture for simulating military command and control. Eur. J. Operat. Res. 123, 558–567. Ling, C.X., Wang, H., 1997. Towards optimal weights setting for the 1-nearest neighbor learning algorithm. Artif. Intell. Rev. 11, 255–272. Luoto, M., Kuussaari, M., Rita, H., Salminen, J., von Bondsdorff, T., 2001. Determinants of distribution and abundance in the clouded apollo butterfly: a landscape ecological approach. Ecography 24, 601–617. Mair, C., Kadoda, G., Lefley, M., Phalp, K., Schofield, C., Shepperd, M., Webster, S., 2000. An investigation of machine learning based prediction systems. J. Systems Software 53, 23– 29. Maron, O., Moore, A.W., 1997. The racing algorithm: model selection for lazy learners. Artif. Intell. Rev. 11, 192–225. Meiner, A. (Ed.), 1999. Eesti maakate. CORINE Land Cover projekti täitmine Eestis. Land Cover of Estonia. Implementation of CORINE Land Cover project in Estonia. Tallinn, 133 pp. Micarelli, A., Neri, A., Sansonetti, G., 2000. A case-based approach to image recognition. In: Blanzieri, E., Portinale, L. (Eds.), Advances in Case-Based Reasoning, LNAI 1898. Springer, Berlin, pp. 443–454. Michalski, R.S., 1986. Understanding the nature of learning: Issues and research directions. In: Michalski, R.S., Carbonell, J.G., Mitchell, T.M. (Eds.), Machine Learning. An Artificial Intelligence Approach. vol. II. Morgan Kaufmann Publishers, Los Altos, CA, pp. 3–25. Milne, B.T., Johnston, K.M., Forman, R.T.T., 1989. Scaledependent proximity of wildlife habitat in a spatially-neutral Bayesian model. Landsc. Ecol. 2 (2), 101–110. Mitchell, T., 1997. Machine Learning. McGraw-Hill, Boston, MA, 414 pp. Moeur, M., Stage, A.R., 1995. Most similar neighbour: an improved sampling inference procedure for natural resource planning. Forest Sci. 41, 337–359. Moore, D.M., Lees, B.G., Davey, S.M., 1991. A new method for predicting vegetation distributions using decision tree analysis in a geographic information system. Environ. Manage. 15 (1), 59–71. Muinonen, E., Maltamo, M., Hyppänen, H., Vainikainen, V., 2001. Forest stand characteristics estimation using a most similar neighbor approach and image spatial structure information. Remote Sens. Environ. 78, 223–228. Münier, B., Nygaard, B., Ejrnæs, R., Bruun, H.G., 2001. A biotope landscape model for prediction of semi-natural vegetation in Denmark. Ecol. Model. 139, 221–233. Navinchandra, D., 1989. Case-based reasoning in CYCLOPS, design problem solver. In: Proceedings of the DARPA Workshop on Case-Based Reasoning, pp. 286–301. Neave, H.M., Norton, T.W., 1998. Biological inventory for conservation evaluation. IV. Composition, distribution and spatial prediction of vegetation assemblages in southern Australia. Forest Ecol. Manage. 106, 259–281. Okamoto, S., Yugami, N., 1997. Theoretical analysis of case retrieval method based on neighbourhood of a new problem. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning

280

K. Remm / Ecological Modelling 177 (2004) 259–281

Research and Development. Proceedings of the ICCBR-97, LNAI. Springer, Berlin, pp. 349–358. Osborne, J.M., Brearley, D.R., 2000. Completion criteria case studies considering bond relinquishment and mine decommissioning: Western Australia. Int. J. Surface Mining, Reclam. Environ. 14 (3), 193–204. Pazzani, M., 1991. A computational theory of learning causal relationships. Cognitive Sci. 15, 401–424. Perner, P., 1998. Using CBR learning for the low-level and high-level unit of an image interpretation system. ICAPR’98, Plymouth, Peer Reviewed Conference. In: Singh, S. (Ed.), Advances in Pattern Recognition. Springer, Berlin, pp. 45–54. Perner, P., 1999. An architecture for a CBR image segmentation system. Eng. Appl. Artif. Intell. 12, 749–759. Perner, P., 2001. Why case-based reasoning is attractive for image interpretation. In: Aha, D., Watson, I. (Eds.), Case-Based Reasoning Research and Development, LNAI 2080. Springer, Berlin, pp. 27–44. Puttock, G.D., Shakotko, P., Rasaputra, J.G., 1996. An empirical model for moose, Alces alces, in Algonquin Park, Ontario. Forest Ecol. Manage. 81, 169–178. Ramirez, C., Cooley, R., 1997. A theory of the acquisition of episodic memory. In: Wettschereck, D. Aha, D. (Eds.), ECML97 MLNet Workshop Notes Case-Based Learning: Beyond Classification of Feature Vectors. NCARAI Technical Note AIC-97-005, pp. 25–32. Redmond, M., Phillips, S., 1997. Encouraging self-explanation through case-based tutoring: a case study. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning Research and Development. Proceedings of the ICCBR-97, LNAI, Springer, Berlin, pp. 132–143. Remm, K., 2002. Otepää looduspargi taimkatte kasvukohatüüpide kaart. Map of vegetation site types of Otepää Nature Park (in Estonian). In: Frey, T. (toim.) Eesti süsinikubilansi ökoloogiast ja ökonoomikast. Eesti XIII ökoloogiapäev 03. mail 2002 Tartus EPMÜ aulas, Tartu, pp. 62–76. Reutter, B.A., Helfer, V., Hirzel, A.H., Vogel, P., 2003. Modelling habitat-suitability using museum collections: an example with three sympatric Apodemus species from the Alps. J. Biogeogr. 30, 581–590. Ross, S., Fang, L., Hipel, K.W., 2002. A case-based reasoning system for conflict resolution: design and implementation. Eng. Appl. Artif. Intell. 15, 369–383. Sachot, S., 2002. Viability and management of an endangered capercaillie (Tetrao urogallus) metapopulation, Thèse de Doctorat. Institute of Ecology, Lausanne, 131 pp. Sagris, V., Krusberg, P., 1997. Estonian base map project and its applications. In: Hodgson, S., Rumor, M., Harts, J.J. Geographical Information ’97. Joint European Conference and Exhibition on Geographical Information. Proceedings, vol. 2. IOS Press, Amsterdam, pp. 937–945. Saveraid, E.H., Debinski, D.M., Kindscher, K., Jakubauskas, M.E., 2001. A comparison of satellite data and landscape variables in predicting bird species occurrences in the Greater Yellowstone ecosystem. Landsc. Ecol. 16, 71–83. Sazanova, L., Osipov, G., Godovnikov, M., 1999. Intelligent system for fish stock prediction and allowable catch evaluation. Environ. Model. Software 14, 391–399.

Schank, R.C., 1982. Dynamic Memory: A Theory of Learning in Computers and People. Cambridge University Press, Cambridge, New York, Melbourne, 234 pp. Schank, R., Abelson, R., 1977. Scripts, Plans, Goals and Understanding. Erlbaum, Hillsdale, NJ, 248 pp. Seitz, A., Uhrmacher, A.M., Damm, D., 1999. Case-based prediction in experimental medical studies. Artif. Intell. Med. 15, 255–273. Stanfill, C., Waltz, D., 1986. Toward memory-based reasoning. Commun. ACM 29 (12), 1213–1228. Stankovski, V., Debeljak, M., Bratko, I., Adamiè, M., 1998. Modelling the population dynamics of red deer (Cervus elaphus L.) with regard to forest development. Ecol. Model. 108, 145– 153. Suh, M.S., Jhee, W.C., Ko, Y.K., Lee, A., 1998. A case-based expert system approach for quality design. Expert Syst. Appl. 15, 181–190. Tomppo, E., Nilsson, M., Rosengren, M., Aalto, P., Kennedy, P., 2002. Simultaneous use of Landsat-TM and IRS-1C WiFS data in estimating large area tree stem volume and aboveground biomass. Remote Sens. Environ. 82, 156–171. Verdenius, F., Broeze, J., 1999. Generalised and instance-specific modelling for biological systems. Environ. Model. Software 14, 339–348. Vogelmann, J.E., Sohl, T., Howard, S.M., 1998. Regional characterization of land cover using multiple sources of data. Photogramm. Eng. Remote Sens. 64 (1), 45–57. Watson, I., Perera, S., 1997. The evaluation of a hierarchical case representation using context guided retrieval. In: Leake, D.B., Plaza, E. (Eds.), Case-Based Reasoning Research and Development. Proceedings of the ICCBR-97, LNAI, Springer, Berlin, pp. 255–266. Wettschereck, D., Aha, D.W., 1995. Weighting features. In: Veloso, M.M., Aamodt, A. (Eds.), Case-Based Reasoning Research and Development, Proceedings of the First International Conference, ICCBR-95, LNCS 1010, Sesimbra, Portugal, 23–26 October 1995, Springer, pp. 347–358. Wettschereck, D., Dietterich, T.G., 1995. An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms. Machine Learn. 19, 5–27. Wettschereck, D., Aha, D.W., Mohri, T., 1997. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artif. Intell. Rev. 11 (1–5), 273–314. Wilds, S., Boetsch, J., van Manen, F.T., Clark, J.D., White, P.S., 2000. Modeling the distributions of species and communities in Great Smoky Mountains National Park. Comput. Electron. Agric. 27, 389–392. Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybernetics 2 (3), 408–421. Wilson, D.R., Martinez, T.R., 2000. Reduction techniques for instance-based learning algorithms. Machine Learn. 38, 257– 286. Winston, P.H., 1986. Learning by augmented rules and accumulating censors. In: Michalski, R.S., Carbonell, J.G., Mitchell, T.M. (Eds.), Machine Learning. An Artificial Intelligence Approach, vol. II. Morgan Kaufmann, Los Altos, CA, pp. 45–61.

K. Remm / Ecological Modelling 177 (2004) 259–281 Wu, X.B., Smiens, F.E., 2000. Multiple-scale habitat modeling approach for rare plant conservation. Landsc. Urban Plan. 51, 11–28. Yeh, A.G.O., Shi, X., 2001. Case-based reasoning (CBR) in development control. Int. J. Appl. Earth Obs. Geoinform. 3 (3), 238–251.

281

Zhang, J., Yim, Y.S., Yang, J., 1997. Intelligent selection of instances for prediction functions in lazy learning algorithms. Artif. Intell. Rev. 11 (1-5), 175–192. Zhang, L., Coenen, F., Leng, P., 2002. Formalising optimal feature weight setting in case based diagnosis as linear programming problems. Knowledge-Based Systems 15, 391–398.