Ecological Modelling 268 (2013) 55–63
Contents lists available at ScienceDirect
Ecological Modelling journal homepage: www.elsevier.com/locate/ecolmodel
Combining simulated expert knowledge with Neural Networks to produce Ecological Niche Models for Latimeria chalumnae Gianpaolo Coro a,∗ , Pasquale Pagano a , Anton Ellenbroek b a b
Istituto di Scienza e Tecnologie dell’Informazione “Alessandro Faedo” – CNR, Pisa, Italy Food and Agriculture Organization of the United Nations (FAO), Italy
a r t i c l e
i n f o
Article history: Received 23 March 2013 Received in revised form 31 July 2013 Accepted 8 August 2013 Available online 11 September 2013 Keywords: Ecological Niche Modelling AquaMaps Neural Networks Latimeria chalumnae
a b s t r a c t The order Coelacanthiformes, once thought extinct, is much studied mainly because it contains species that share characteristics with lungfishes and tetrapods. Only a few years ago living specimens were discovered to science, and observations are so rare that the species are considered to be critically endangered. Observations include Latimeria chalumnae in deep waters of the coast of south eastern Africa while Latimeria menadoensis is known from similar habitats in Indonesian waters. Because of the interest around these enigmatic species, Ecological Niche Modelling techniques have been applied to estimate their distribution. The underlying assumption is that the environmental characteristics of the observation points are representative for the species. In this article we evaluate the difference in the output between the niche distributions produced by two expert systems and by two models based on Artificial Neural Networks. We evaluate the predictive behaviour of such models by focusing on L. chalumnae, as more observations are available for this species with respect to L. menadoensis. Finally, we assess the reliability of the maps by numerically evaluating the representativeness of the environmental characteristics in the observation locations, with respect to an area where the models show significant differences. This approach is different from previous ones because one of the expert systems is used to infer pseudoabsence points, that are successively employed to feed a Neural Network. One of the models based on this Neural Network is used to estimate the potential distribution and to produce a more extended map. The method promises to be applicable to other species with few observations, and allows to exploit the power of presence\absence based techniques. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Species belonging to the order of Coelacanthiformes, once considered extinct, have been discovered during the last century. The order gained scientific interest because these species represent a potential link between lungfishes and tetrapods. The actual distribution (Pearson, 2012) of the Latimeria chalumnae, that is the area with a high probability of occurrence of the species, is located around the south eastern coast of Africa (Smith, 1939), while Latimeria menadoensis lives in Indonesian waters (Erdmann et al., 1998). Several influential studies have been published on the Coelacanthiformes, in particular on the L. chalumnae, the most encountered species in the order. Fricke (1997) analysed issues in the conservation of L. chalumnae and demonstrated that human activity can highly affect its chance of survival. Other studies have investigated its biological characteristics, and performed
∗ Corresponding author. Tel.: +39 050 315 2978; fax: +39 050 621 3464. E-mail addresses:
[email protected],
[email protected] (G. Coro),
[email protected] (P. Pagano),
[email protected] (A. Ellenbroek). 0304-3800/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ecolmodel.2013.08.005
phylogenetic comparisons with other craniata by reconstructing its cranial nerves (Northcutt and Bemis, 1993). We chose to study if the development of conservation strategies for this species can benefit from spatial modelling and dataset analysis. We took the paper by Owens et al. (2012) as reference study for our experiments on the L. chalumnae. The authors used Ecological Niche Modelling (ENM) techniques to estimate the extent of the potential distribution of the Coelacanths. This identifies the areas where abiotic conditions fall within the fundamental niche of the species. They used only environmental data associated to presence information and trained two models based on GARP (Stockwell, 1999) and MaxEnt (Berger et al., 1996) algorithms, using 13 environmental parameters ranging from benthic temperature, to oxygen, chlorophyll, phosphate silicate and nitrate concentrations. The projection of such models on the oceans produced probability distributions that were visualized on a geographical map. By relying on a qualitative analysis of the results, they argued that the distribution of the Coelacanths could extend well beyond their known distribution. To validate the results, we focused on the L. chalumnae and developed a model based on an authoritative occurrence data source, FishBase (Froese and Pauly, 2000). Our method uses
56
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
Fig. 1. Representation of our method. The Neural Network is trained on the basis of presence data from FishBase and absence data simulated through the AquaMaps Native distribution.
presence information from the FishBase database along with absence information produced by niche knowledge from scientists. In particular an expert system is used to simulate scientists’ knowledge and to generate pseudo-absence points. Such information is then used to train an automatic model relying on environmental parameters associated to both presence and pseudo-absence information. Fig. 1 depicts an overall representation of the process. We show that this approach results in a model with higher abstraction capabilities. It is able to identify areas beyond the training range, but where the species is known to live. We take the maps in Owens et al. (2012) as qualitative reference for the results. The final part of the experiment evaluates the representativeness of the training set with respect to some projection areas. Representativeness indicates the reliability of a model and its expected performance. The novelty of our approach is that it can be used in cases when only few species observations are available. It allows to exploit the power of presence\ absence based techniques, that can result in more accurate distribution maps compared to presence-only approaches, under certain conditions. The paper is organized as follows: Section 2 contains an overview about Ecological Niche Modelling approaches. In Section 2.1, we discuss about the controversial problem of presence-only approaches against presence\ absence approaches. In Section 2.2, we introduce the AquaMaps and the Artificial Neural Network models we used in the experiments, with a first clarification about how these models address the discussion in the preceding Section. In Section 3, we explain the motivations and rationale behind our approach. In Sections 3.1 and 3.2, we depict the method used to generate pseudo-absence data from AquaMaps and to model a Neural Network with this information. Section 3.3 contains a discussion about the possible consequences of our choice to use few pseudoabsences. Section 3.4 gives an insight of the method we adopted to evaluate the representativeness of the training set with respect
to some projection areas. Section 4 reports the numerical comparisons between the maps generated for L. chalumnae along with their performances. It also reports on the numerical estimation of the reliability of the combined model. Finally, Section 5 contains a discussion about the results. 2. Overview Ecological Niche Modelling is a complex and iterative process (Elith and Leathwick, 2009) including (i) identification of relevant data, (ii) modelling, (iii) and projection of predictions onto a geographic space. The first step includes the identification of the environmental features that relate to species preferences. The modelling techniques are usually based on occurrence records (presence points), i.e. places where the species has been observed in its habitat. Some approaches need even to use absence points, i.e. locations where the environment is considered unsuitable for the species (Guisan and Zimmermann, 2000). In many cases, absence points must be simulated (pseudo-absence points), because reliable data are rare, as is the case with L. chalumnae. Models need representative occurrence data and independent and complete environmental parameters in order to work properly. These are expected to provide robustness and reliability to the models (Kamino et al., 2012; Elith and Leathwick, 2009). The choice of a suitable modelling technique for a specific scenario is not trivial. There is no general pattern to follow when designing an ENM experiment: each species can be very specific in terms of habitat and presence\ absence information. In Elith and Leathwick (2009) the authors report several applications and eventually indicate possible directions for producing robust models. These include (i) improvement of methods to model presence-only data, (ii) accounting for biotic interactions and (iii) assessing model uncertainty. Similar advices come from the BAM diagram for Biotic Interactions
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
described in Peterson et al. (2011). Other issues involve the choice of the kind of modelling technique to apply: a model could try to explicitly catch the preference of a species and its physiological limits and tolerances (mechanistic approaches) (Chuine and Beaubien, 2008). Otherwise it could automatically extract the correlations between the environmental features vectors and the species presence (correlative approaches) (Pearson, 2012). Several tools allow scientists to produce maps by applying Niche Modelling algorithms ˜ et al., 2011; Coro, 2011). In Pearson (2012) a large (de Souza Munoz collection of techniques is presented along with the kind of scenarios these should be applied to. The next section analyses the differences between approaches based purely on presence points and those using also absence points. 2.1. Presence-only vs. presence\absence models The decision regarding the usage of presence\absence models instead of presence-only models is controversial. Several studies have focused on the different behaviour of Generalised Linear Models (GLMs), using presence\ absence data, against Ecological Niche Factors (ENFA), using presence-only data. Brotons et al. (2004) supported the idea that GLMs predictions are more accurate than those obtained with ENFA. They compared the models in bird habitat suitability prediction. The difference was evident when the species were using available habitats proportionally to their suitability. They advised that absence data had to be reliable and then useful to enhance model calibration. They also highlighted that models for wide-ranging and tolerant species were more sensitive to absence data, suggesting that presence\ absence methods may be particularly important for predicting distributions of these types of species. Hirzel et al. (2001) investigated the behaviour of the same models by using a virtual species, simulating different distribution scenarios (spreading, equilibrium and overabundant). They reported that GLMs were badly affected in the case of spreading species, but produced better results than ENFA when the species was overabundant. At equilibrium both methods produced equivalent results. The authors discussed on the importance of absence data quality, which strongly affected the GLMs performances. When the absences were due to historical causes (like in the ’spreading’ scenario) GLMs predictive performances decreased. On the other hand, GLMs were poorly affected by data quantity. In Hirzel et al. (2002a,b), the authors insisted on the usage of presence-only models because absence data are often unreliable. They investigated on the robustness of ENFA with respect to the quality of the involved environmental parameters. Spurious correlations strongly influenced the performances and correlation analysis was necessary to reduce their impact. Ferrier and Watson (2007) used Generalized Additive Models (GAMs) alternatively with presence-only and presence\ absence information with an application to plants. GAMs using presence-only data gave weaker predictions than those using presence\ absence data. Zaniewski et al. (2002) used GAMs on plants to analyse the effects of pseudo-absences with respect to real absences. They discussed about the proper weighting of the environmental variables when using ENFA. In the cases where presence\ absence data were unreliable, presence-only GAMs and ENFA showed better performance for predicting species spatial distributions. Thus, data quality is a key factor when choosing the appropriate model. As stated in Brotons et al. (2004) and in Mateo et al. (2010), the assumption that absence indicates areas where a species is not present due to a negative species-environmental relationship is not necessarily a valid one. Many non-environmental factors can influence the absence of a species in a certain environment. Incorporation of this type of absence data in statistical modelling strategies can introduce too many unconfirmed assumptions and can lead to less optimal models. If absences are indeed related to low suitable habitat for the species, information brought
57
by them can improve the performance of methods relying on both presence and absence data (Brotons et al., 2004). 2.2. AquaMaps and Artificial Neural Networks Ready et al. (2010) report a comparison among the most widely used correlative approaches applied to some marine species of commercial and scientific interest. Among these, the AquaMaps algorithms (Kaschner et al., 2006, 2008) are presence-only species models that allow for the incorporation of expert knowledge about the species habitat. Two algorithms are available, named AquaMaps Suitable and AquaMaps Native. The former addresses potential distribution modelling, while the latter focuses on actual distribution modelling. The AquaMaps distributions are generated using information about species preferences on environmental properties like depth, salinity, temperature, primary production, distance from land, and sea ice concentration. Maps are produced at 0.5◦ resolution. The expert knowledge is used in modelling the habitat parameters and the species preferences. The values of the environmental features are manually edited before applying the model. After that, a trapezoidal function (envelope) is traced for each species. This function represents the ‘preferred’ values for that parameter and can be automatically produced by processing the value ranges associated to the presence points. In particular, the trapezoid is traced on 4 values called minimum, preferred minimum, preferred maximum and maximum. These values are calculated, for each parameter, by a rule-based procedure (Kaschner et al., 2008) using percentiles of the values observed at the presence points. In some cases the trapezoid can be manually defined by a biologist. The probabilities are produced by multiplying the values of the functions for each relevant 0.5◦ cell in the oceans. The main difference between the Suitable and the Native algorithms is that, in the latter, the distributions are filtered and adjusted according to the expected bounding box of the species habitat and to the FAO ocean areas in which the species has been observed. AquaMaps adopts mechanistic assumptions combined with an automatic estimation of parameter values. After the model projection, a scientist can review a map by manually changing the trapezoidal curves or by modifying the values in the produced distribution table. AquaMaps is a reference algorithm for marine species distribution modelling, as it shows good performance if compared to other purely automatic procedures (Corsi et al., 2000). The AquaMaps distribution was the base layer of our system. In the case of L. Chalumnae, the AquaMaps project website (AquaMaps, 2013) distributes the map of the species in tabular and image format. The AquaMaps Native map available on the website has been approved by a biologist and reports the distribution off the south eastern coast of Africa. This allows us to hypothesise a good quality for the presence and absence data at least in that area. This scenario meets the requirements indicated by the discussion in Section 2.1 on presence\ absence points. In fact we used the AquaMaps distribution to produce reliable absence points, as explained in Section 3. This allowed us to produce a presence\ absence model. For this purpose Artificial Neural Networks (ANNs) were used, which have demonstrated to gain high accuracy (Pearson et al., 2002; Segurado and Arajo, 2004; Thuiller, 2003) in niche modelling. ANNs implement a correlative approach as they try to automatically simulate the probability for a species given certain environmental conditions. Other models, like the cited GLMs and GAMs, but also SVMs (Drake and Randin, 2006) were possible alternative choices. We chose ANNs because of our previous experience in using them in ENM (Coro et al., 2013), but also because we could rely on literature references, to finely tune the model and to take control of the overfitting issues (Özesmi et al., 2006; Scardi, 2001). In particular, we applied such techniques to account for our scenario, in which data quality was high but few samples were available. Section 3
58
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
Fig. 2. Representation of (a) the Latimeria chalumnae raw records stored in FishBase (Froese and Pauly, 2000) and of (b) the related AquaMaps Native distribution depicting the probabilistic actual distribution of the species.
outlines the method used to train an ANN on the basis of reliable pseudo-absence data produced from AquaMaps.
Suitable models produces a map for the potential distribution of the L. chalumnae, which is similar to those in Owens et al. (2012).
3. Method
3.1. Pseudo-absence points generation
In the work by Owens et al. (2012), presence records of the species were taken from the OBIS data source (Grassle, 2000) and other South African local reports (Scott, 2006). After quality control, only 8 presence points remained for L. chalumnae and 2 for L. menadoensis. Nevertheless, the models were trained using environmental features associated to a region that went from the tip of the Indian Peninsula to the Cape of Good Hope. They integrated such information with presence points. In our work we used FishBase (Froese and Pauly, 2000), the largest and most consulted online database on adult fishes. Each species record contains information about taxonomy, geographical distribution, biometrics and morphology, behaviour and habitats, ecology and population dynamics as well as reproductive, metabolic and genetic data. Currently, the repository includes 32,500 species, 299,600 common names, 52,200 pictures and 48,600 references. It has 1990 collaborators and records 700,000 visits per month. FishBase stores not a single presence point of the L. menadoensis, but contains 34 points for the L. chalumnae, of which 21 observations are suitable for an ecological niche modelling algorithm. The points are mainly located off the south eastern coast of Africa. In the here presented experiment, we trained four models: two expert systems (AquaMaps Native and Suitable) that combine expert knowledge from biologists with a rule based approach, and two other models based on one Feed-Forward Neural Network (NN Native and NN Suitable), which rely on a purely automatic (correlative) approach. The AquaMaps Native and NN Native algorithms model the actual distribution of the species, while the AquaMaps Suitable and NN Suitable algorithms address the potential distribution. The AquaMaps Native map was validated by a human expert (independently from this work) and published on the AquaMaps website (AquaMaps, 2013). This ensured a certain reliability of the model. Section 3.2 reports how we made the models collaborate, using the AquaMaps Native distribution to feed the Neural Network-based systems. Fig. 2 shows the distribution of the points in FishBase and the related actual distribution projected by the AquaMaps Native algorithm. The environmental features used in our models are extracted at those locations and it is worth noting that FishBase does not record any presence point in Indonesia. The models used in Owens et al. (2012) were trained using also characteristics of the ocean that were close to Indonesia, which were not available to our models. In Section 4 we show that the proposed combination between the AquaMaps Native and the NN
The novelty introduced with respect to the approach by Owens et al. (2012) is that we chose to use a presence\ absence model and to combine simulated expert knowledge with a purely correlative algorithm. Feed-Forward Neural Networks require also absence data. We chose to produce absence data by taking low probability points from the AquaMaps Native distribution, in particular those 0.5◦ cells with probability higher than 0 but lower than 0.2. We discarded 0 probability points because (i) they could introduce bias in the output by the Neural Network, due to their huge number and to the many non-environmental factors that could have influenced the expert system on those points (Zaniewski et al., 2002), and because (ii) they refer to the edges of the environmental envelope of the species (Mateo et al., 2010; Chefaoui and Lobo, 2008), which are represented by low probability values. Such approach aims to enhance the quality and reliability of the training set, even if it forces to produce few absence points. For the results presented in this paper we used our own implementations (Coro, 2011) of the AquaMaps algorithms, which are fully compliant with the original ones but add parallel processing facilities. The AquaMaps distributions we produced are identical to the validated and published distribution on the AquaMaps website (AquaMaps, 2013). Low probabilities are then considered to indicate a weak relationship between the species and the environment. According to Brotons et al. (2004) such information should improve the performance of an automatic model. Fig. 3 shows the 7 simulated absence points resulting from the AquaMaps distribution according to the above criterion. Thus, we relied on the quality of the data to compensate for the lack of points. On the other side, we took the AquaMaps distribution as reference to test the performances of the trained model. Such approach is also suggested by Scardi (2001) in the case of limited training data, as he recommends constrained training with theoretical knowledge and with a model which combines real data and predictions from another model. 3.2. Modelling On the basis of the features extracted at the presence and simulated absences locations, we trained one Feed-Forward Neural Network. The method presented in Section 3.1 generates few absence points. In this situation the Neural Network has to minimize the prediction error on the training set but also to maximize the accuracy with respect to some test data (Özesmi et al., 2006; Boyce et al., 2002), that in our case were simulated by the AquaMaps
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
Fig. 3. Absence points, simulated taking locations in the AquaMaps Native distribution with probabilities higher than 0 and lower than 0.2.
Native distribution in the south African area. In order to produce a distribution map we implemented a procedure that altered the AquaMaps Native and Suitable algorithms. The AquaMaps algorithms are made up of a probability calculation phase, followed by a set of post-processing rules which adjust the values according to the locations they refer to. This is necessary, for example, to check that a species is predicted in an expected boundary box. Thus, we built two distributions using the same Neural Network for probabilities calculation. These were different by the fact that the former maintained the post-processing procedures of the AquaMaps Native algorithm, while the latter maintained those from AquaMaps Suitable. We called the resulting algorithms NN Native and NN Suitable respectively. The NN Native algorithm models the actual distribution of the species, while the NN Suitable addresses the potential distribution. NN Suitable searches for areas with environmental characteristics similar to those in the presences set and far from those in the absences. It is important to remark that we did not put the input environmental features into discussion. These come from the AquaMaps model and we chose not to alter them in order to use the exact amount of information of such model. Therefore, in our models we chose not to use other assessment techniques in model selection, like BIC or AIC (Boyce et al., 2002), because these would have altered the environmental information set suggested by the AquaMaps scientists (Kaschner et al., 2008). Thus, we relied on a strategy which only changed the topology of the network. We gave the same prior weight to every environmental feature passed as input to the network. We based the Neural Network model selection on the best practice indications by Özesmi et al. (2006). We adopted the Least Prediction Error criterion and used a test distribution (AquaMaps Native) to assess the training quality. Instead of using a test set to calculate accuracy while training the Neural Network (Boyce et al., 2002), we directly used the AquaMaps Native distribution and controlled its discrepancy with respect to the projection of the Network off the south eastern coast of Africa, outside of the training set locations. Comparisons were done at 0.5◦ and we considered a location to be far of a training set point if they were separated by at least 1◦ . At the same time we minimized the prediction error on the training set, by using the standard back-propagation algorithm (Bryson et al., 1979). The discrepancy with respect to AquaMaps Native was calculated by setting a 0.1 tolerance in the probabilities difference
59
and calculating the average probability discrepancy. The Feed Forward Neural Network was trained using the same environmental information as in AquaMaps. The inputs were vectors of 10 real numbers reporting (for 0.5◦ oceans cells): the minimum, maximum and mean depth, the mean annual values for salinity, bottom salinity, surface temperature, bottom temperature, primary production, distance from land and sea ice concentration. Thus, the network had 10 input neurons and 1 output neuron, returning real numbers ranging from 0 to 1. Hidden layers were necessary because the function to be simulated by the network was not linear. The training set was made up of 21 presence features vectors and 7 simulated absences feature vectors. In the training session we changed the number of hidden layers and of neurons in each layer. We adopted a growing approach, in which we added neurons and layers as far as the error respect to the training set decreased. In order to avoid overfitting we also took into account the discrepancy with respect to AquaMaps Native. This had to reduce the average probability difference outside of the training set. Eventually we took the best performing topology; we stopped either when the training error increased after a decrease or when the discrepancy with respect to AquaMaps outside the training set increased (Özesmi et al., 2006). The random initialization of the Neural Network weights required several runs. The best model had 2 hidden layers with 100 neurons in the former layer and 2 neurons in the latter. Using presences and simulated absences to train a Feed-Forward Neural Network is one possible way to combine real observations with the decisions taken by the AquaMaps expert systems. In Section 4 we show that, in the case of L. calumnae, this combination allows the Neural Network to discover suitability also in Indonesia. 3.3. Effects of the usage of a low number of pseudo-absence points In this section we comment the influence, on the expected performances, of the low number and of the location of absence points we took into account. This is linked to the more general discussion about prevalence, defined as the ratio of the number of presences to the total number of data points used in building the model. Several studies have investigated on the effects of prevalence on models accuracy. Jiménez-Valverde et al. (2009) simulated several prevalence scenarios for a virtual species. They found that it is a property of the dataset more than of the species and it may affect those datasets storing poor information or referring to rare species. When a dataset does not contain a sufficient number of samples, it cannot ensure the model performances to be independent of the sample size. On the other side, Jiménez-Valverde et al. (2009) advise to use the best absence data possible when modelling the distribution of species on narrow geographic ranges in the case of biased prevalence or rare species. In this last case, the common strategy of resampling the data to obtain training data with prevalences of 0.5 should be discarded. In our scenario, prevalence is 0.75, and the sample size is small, which should result in a general overprediction if absence data quality was poor (Jiménez-Valverde et al., 2009). This is not the case if we compare the results we obtained by means of a Neural Network with respect to the AquaMaps distributions. Other works, in fact, support that in the absence of noise the effect of prevalence is not so influential (Cramer, 1999). The locations where the absence points were taken, completely relied on the AquaMaps Native distribution values. The effects of this choice can be evaluated by looking at recent studies about pseudo-absence observations sampling. Stokland et al. (2011) studied alternative ways of generating pseudo-absence data and how these affected the resulting models. They observed that a change in the locations from which pseudo-absences are taken affected the performances more than their number. According to them a good practice is to take the pseudo-absence observations within the geographical or environmental range where a species is known
60
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
to occur. Removing absence observations in the zone of presence observations has the effect that the probability of occurrence becomes artificially high. Similar advices come from VanDerWal et al. (2009), who studied the importance of the size of the area from which pseudo-absences are taken and how far should these be from presence locations. In particular their models performances changed according to the fact that the pseudo-absence points were taken randomly or following a fixed schema. Such behaviour was reflected also in the weights the input environmental variables were given in the generated model. VanDerWal et al. (2009) indicated that model performance is higher if pseudo-absence points are taken from a region which is neither restricted nor broad. Mateo et al. (2010) advised that performances may benefit from taking absences in sites where related species have been collected, but not the species being modelled. With a similar method as Brotons et al. (2004), they detected an improvement of the performances if the pseudo-absences correctly represented areas that were unsuitable for the species. The number of errors could strongly increase if false absences were included. Chefaoui and Lobo (2008) explored the effects of introducing unreliable absence data and advised to take them from an environment clearly separated from the best representation of presence data. The most common effect would be over-prediction by the model. Thus, they took absences far distant from presence points in the environmental representation space rather than in the geographical space. In this context, our choice is justified because it aims to meet the above requirements. In fact, the AquaMaps Native algorithm introduces some constraints on the probability values according to the location they refer to. The non zero probability locations must lie in an expected bounding box for the species, where the corresponding half degree cells are called good cells and represent suitable environmental locations for the species. This scenario complies with the indications by Stokland et al. (2011) and VanDerWal et al. (2009) who recommend to have absences near geographic presences. Furthermore, our approach associated pseudo-absence points with low probability locations. Referring to the envelopes of the environmental parameters associated to the presence points, this means that our pseudo-absence points lie at the edge of the environmental boundaries of the species, which is similar to what both Mateo et al. (2010) and Chefaoui and Lobo (2008) suggest. Finally, the fact that we rely on a map validated by an expert should account for the compliance with respect to the quality requirements of all the cited works. 3.4. Features representativeness Feature analysis is crucial in ENM. A preliminary processing of the features vectors constituting the training set could highlight useless features or could evaluate the potential robustness of the models to produce. One of the best known techniques is the Principal Component Analysis (PCA) (Jolliffe, 2005), a mathematical procedure that aims to reduce the dimensionality of the features space. PCA uses an orthogonal transformation in the features space for producing independent variables called principal components. This transformation can be useful for investigating the correlations among the environmental features used in ENM. Adding more dependent variables, in fact, usually does not result in better models. PCA is not specific to biological applications; also other topic-oriented transformations can rely on it. The Habitat Representativeness Score (HRS) (MacLeod, 2010) is an algorithm based on PCA applicable to marine species environmental features. It measures the degree to which sampled habitats are representative for a certain area of study. HRS has been used for assessing the minimum number of surveys on a study area that are needed to cover a good heterogeneity of species habitat variables. HRS can be applied to two datasets of environmental features, one representing
Table 1 Cell-to-cell comparison between the AquaMaps Native and the NN Native distributions and between the AquaMaps Suitable and the NN Suitable distributions.
Probability tolerance Mean discrepancy Discrepancy variance Maximum probability Difference Maximum probability Difference cell centre Percentage of discrepant values
AquaMaps Native vs. NN Native
AquaMaps Suitable vs. NN Suitable
0.1 0.75 0.05
0.1 0.58 0.09
1.0
1.0 ◦
◦
−8.75 , 39.75 33.8%
6.75◦ , 2.25◦ 51.1%
a sampled area and the other a geographical region of interest. A score is produced for each feature, ranging from 0 to 2, with 2 representing completely non-overlapping distributions of values. The lower the HRS the more similar data obtained from a survey are to the study area. In this paper we use HRS for assessing how much the features associated to species occurrence points represent a projection area. Consequently, this gives an indication on how reliably the model is trained on those features. 4. Results We calculated the overall distance between the AquaMaps Native and NN Native distributions and between the AquaMaps Suitable and NN Suitable distributions. Fig. 4 compares the AquaMaps Native and NN Native distributions, while Fig. 5 compares the AquaMaps Suitable and the NN Suitable distributions. A discrepancy is evident for the Suitable algorithms in the Indonesia area where we know, from Owens et al. (2012), that the species can be present. We used a tabular representation of the maps in which a probability value was reported for each 0.5◦ cell. Table 1 shows the discrepancies between the maps, according to a 0.1 tolerance in the probability differences. These discrepancy calculations provide a numerical evaluation of the distance between the maps. The Native maps show more similarity between them, with respect to the Suitable ones, which is an effect of the training method described in Section 3.2. Furthermore, we calculated the performances of the AquaMaps and NN Suitable models on the presences and absences points, according to standard measurements indicated in Pearson (2012) and in Boyce et al. (2002). On one side, this calculates how much the AquaMaps Suitable distribution is distant from the AquaMaps Native distribution, as the latter has full agreement with the training set. On the other side, this measures how much the NN Suitable is able to fit the training set, which is not trivial if there is noise in the data. The performances are reported in Table 2. They indicate a good adaptation by the Neural Network to the training set and a low average agreement between the AquaMaps Suitable and Native distributions. As also reported in Table 2, the largest discrepancy between the Suitable maps exists in Indonesia. Looking at the maps in Fig. 6, it can be noticed that Indonesia is indeed a suitable area also for the GARP and MaxEnt models which is also confirmed by experts’ opinion, as reported by Table 2 Performances of the AquaMaps Suitable distribution and of the Neural Network Suitable distribution on both the presences and simulated absences sets. AquaMaps Suitable Accuracy Sensitivity Omission rate AUC ROC optimal threshold
49% 38% 62% 0.41 0.17
Neural Network Suitable 95% 94% 5.9% ∼1 ∼0
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
61
Fig. 4. Comparison between the (a) AquaMaps Native and the (b) NN Native distributions.
Fig. 5. Comparison between the (a) AquaMaps Suitable and the (b) NN Suitable distributions. An area in Indonesia is highlighted to indicate disagreement between the maps.
Owens et al. (2012). We then selected an area in Indonesia defined by the bounding box (95.34 ◦ , −9.19 ◦ ; 125.67 ◦ , 12.98 ◦ ). We calculated the Habitat Representativeness Scores by the training set for this box, with and without the simulated absence points. The result for the set containing simulated absences and presences was 10.58, while the score for the presences-only set was 10.61. Fig. 7 graphically represents the operation. According to the recommendations in MacLeod (2010), the score is too high in both cases. This means that the whole set of training features is not representative for the Indonesia area, and we cannot expect reliable Suitable distributions. On the other side, since we know from Owens et al. (2012) that the area is really suitable for L. chalumnae, we can argue that the Neural Network produces a better quality distribution. The details about the HRS for each environmental feature are reported in Table 3. They show that some of the features are indeed well represented. In particular, the minimum depth in the 0.5 cell, mean annual bottom salinity, primary production and distance from land have similar variations with respect to the training set. The usage of the simulated absences set does not change the representativeness scores. This means that also the variations in the absences features
are well represented in Indonesia. We can argue that the NN Suitable algorithm has identified the importance of such elements, something which is not possible with the AquaMaps Suitable algorithm. This is due to the abstraction power of the Neural Network but also to the usage of the simulated absence points. Table 3 Detailed Habitat Representativeness Scores on Indonesia, first considering the simulated absences and presences sets and then considering presences sets only. Features
HRS on Ab. and Pr.
HRS on Pr.
Mean depth Maximum depth Minimum depth Mean annual surface temperature Mean annual bottom temp Mean annual salinity Mean annual bottom salinity Mean annual primary production Mean annual ice concentration Distance from land Ocean area in the cell
1.90 0.87 0.04 1.19 1.59 1.23 0.44 0.61 0.71 0.46 1.54
1.92 0.86 0.04 1.13 1.56 1.29 0.34 0.64 0.78 0.49 1.55
62
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
Fig. 6. (1) Actual and (2) potential distributions from Owens et al. (2012). The upper charts report the GARP distributions, while the lower charts report the MaxEnt distributions.
Fig. 7. Graphical representation of the Habitat Representativeness Scores between the training sets and an area in Indonesia.
5. Conclusions We have presented a technique that can help in conservation strategies for the L. chalumnae, but could be applied also to other species. A model using presence\ absence information can be suitable if the quality of the features is high, as discussed in Section 3.3. Thus, we have used the output of an expert system to simulate the opinion by a scientist on unsuitable locations for the species. We have validated the classification power of the model on the L. chalumnae taking the work in Owens et al. (2012) as reference. The case is interesting from the biodiversity conservation and evolutionary biology points of view. We have proposed a possible combination between an expert system and a Feed-Forward Neural Network, in which the former is used to produce pseudo-absence
data that feed the latter. The projection of the combined model is distant from the map generated by the expert system alone. In particular a disagreement is evident in the Indonesian area, which does not have any associated observations in the training set. We have performed a numerical comparison between the maps in order to estimate their distance. Furthermore, we have evaluated the representativeness of the training set with respect to the disagreement area. Such evaluation demonstrates that it is not possible to a priori state that the expert system is more reliable than the Neural Network. On the other side, by comparing the distributions with those reported in Owens et al. (2012), we have shown that more reliability has to be assigned to the Neural Network. In fact, it agrees with the indications by other two models trained on areas closer to Indonesia. Furthermore, the suitability of Indonesian waters for
G. Coro et al. / Ecological Modelling 268 (2013) 55–63
the L. chalumnae is also supported by experts’ considerations. The presented method can be applied to other species with few observations. It is currently used by the i-Marine project (i-Marine, 2011) open platform (Gioia and Coro, 2012) that allows scientists to share their datasets and methods and to combine several techniques in transparent and user-friendly collaborative environments. Future work will concentrate on the application of the method to other cases, which will require validation by experts from the i-Marine community. The aim is to understand if the behaviour detected for the combined model in the case of L. chalumnae, is also valid for other species. Acknowledgments The reported work has been partially supported by the i-Marine project (FP7 of the European Commission, INFRASTRUCTURES2011-2, Contract No. 283644). Appendix A. Supplementary Data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ecolmodel. 2013.08.005. References AquaMaps, 2013. The AquaMaps Project Website. http://www.aquamaps.org Berger, A.L., Pietra, V.J.D., Pietra, S.A.D., 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22 (1), 39–71. Boyce, M.S., Vernier, P.R., Nielsen, S.E., Schmiegelow, F.K., 2002. Evaluating resource selection functions. Ecological Modelling 157, http://dx.doi.org/10.1016/ s0304-3800(02)00200-4. Brotons, L., Thuiller, W., Arajo, M.B., Hirzel, A.H., 2004. Presence-absence versus presence-only modelling methods for predicting bird habitat suitability. Ecography 27 (4), 437–448, http://dx.doi.org/10.1111/j.0906-7590.2004.03764.x. Bryson, A.E., Ho, Y.-C., Siouris, G.M., 1979. Applied optimal control: optimization, estimation, and control. IEEE Transactions on Systems, Man and Cybernetics 9 (6), 366–367. Chefaoui, R.M., Lobo, J.M., 2008. Assessing the effects of pseudo-absences on predictive distribution model performance. Ecological Modelling 210 (4), 478–486. Chuine, I., Beaubien, E., 2008. Phenology is a major determinant of tree species range. Ecology Letters 4 (5), 500–510. Coro, G., 2011. Ecological Modelling Library for Gcube vre. Software, [Sofware] Release 1.0.0, 18 May 2011. Coro, G., Pagano, P., Ellenbroek, A., 2013. Automatic procedures to assist in manual review of marine species distribution maps. In: Tomassini, M., Antonioni, A., Daolio, F., Buesser, P. (Eds.), Adaptive and Natural Computing Algorithms. Vol. 7824 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp. 346–355, http://dx.doi.org/10.1007/978-3-642-37213-1 36. Corsi, F., de Leeuw, J., Skidmore, A., 2000. Modeling species distribution with gis. In: Research Techniques in Animal Ecology. Columbia University Press, New York, pp. 389–434. Cramer, J.S., 1999. Predictive performance of the binary logit model in unbalanced samples. Journal of the Royal Statistical Society: Series D (The Statistician) 48 (1), 85–94, http://dx.doi.org/10.1111/1467-9884.00173. ˜ M.E., De Giovanni, R., de Siqueira, M.F., Sutton, T., Brewer, P., Pereira, de Souza Munoz, R.S., Canhos, D.A.L., Canhos, V.P., 2011. Openmodeller: a generic approach to species’ potential distribution modelling. GeoInformatica 15 (1), 111–135. Drake, J.M., Randin, C., 2006. Modelling ecological niches with support vector machines. Journal of Applied Ecology 43 (3), 424–432. Elith, J., Leathwick, J., 2009. Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics 40, 677–697. Erdmann, M., Caldwell, R., Moosa, M.K., 1998. Indonesian king of the sea discovered. Nature 395, 335. Ferrier, S., Watson, G., 2007. An Evaluation of the Effectiveness of Environmental Surrogates and Modeling Techniques in Predicting the Distribution of Biological Diversity. NSW National Parks and Wildlife Service. Fricke, H., 1997. Living coelacanths: values, eco-ethics and human responsibility. Marine Ecology Progress Series 161, 1–15. Froese, R., Pauly, D., 2000. FishBase: a global information system on fishes. FishBase. Gioia, A., Coro, G., 2012. Statistical Manager Service. Software, [Sofware] Release 1.1.0, 12 December 2012. Grassle, J.F., 2000. The ocean biogeographic information system (obis): an on-line, worldwide atlas for accessing, modeling and mapping marine biological data in a multidimensional geographic context. Oceanography 13 (3).
63
Guisan, A., Zimmermann, N.E., 2000. Predictive habitat distribution models in ecology. Ecological Modelling 135 (2–3), 147–186. Hirzel, A., Hausser, J., Chessel, D., Perrin, N., 2002a. Ecological-niche factor analysis: how to compute habitat-suitability maps without absence data? Ecology 83 (7), 2027–2036. Hirzel, A., Hausser, J., Perrin, N., 2002. Biomapper 2.0. Hirzel, A., Helfer, V., Metral, F., 2001. Assessing habitat-suitability models with a virtual species. Ecological Modelling 145 (23), 111–121 http://www.sciencedirect. com/science/article/pii/S0304380001003969 i-Marine, 2011. i-Marine European Project. http://www.i-marine.eu Jiménez-Valverde, A., Lobo, J.M., Hortal, J., 2009. The effect of prevalence and its interaction with sample size on the reliability of species distribution models. Community Ecology, 10, http://dx.doi.org/10.1556/ComEc.10.2009.2.9, 10.1556/ComEc.10.2009.2.9. Jolliffe, I., 2005. Principal Component Analysis. Wiley Online Library. Kamino, L., Stehmann, J., Amaral, S., De Marco, P., Rangel, T., de Siqueira, M., De Giovanni, R., Hortal, J., 2012. Challenges and perspectives for species distribution modelling in the neotropics. Biology Letters 8 (3), 324–326. Kaschner, K., Ready, J.S., Agbayani, E., Rius, J., Kesner-Reyes, K., Eastwood, P.D., South, A.B., Kullander, S.O., Rees, T., Close, C.H., Watson, R., Pauly, D., Froese, R., 2008. AquaMaps: Predicted Range Maps for Aquatic Species. http://www.aquamaps.org/ Kaschner, K., Watson, R., Trites, A.W., Pauly, D., July 2006. Mapping world-wide distributions of marine mammal species using a relative environmental suitability (RES) model. Marine Ecology Progress Series 316, 285–310. MacLeod, C., 2010. Habitat representativeness Score (hrs): a novel concept for objectively assessing the suitability of survey coverage for modelling the distribution of marine species. Journal of the Marine Biological Association of the United Kingdom 90 (07), 1269–1277. Mateo, R.G., Croat, T.B., Felicísimo, Á.M., Munoz, J., 2010. Profile or group discriminative techniques? generating reliable species distribution models using pseudo-absences and target-group absences from natural history collections. Diversity and Distributions 16 (1), 84–94. Northcutt, R., Bemis, W., 1993. Cranial nerves of the coelacanth, Latimeria chalumnae [osteichthyes: Sarcopterygii: Actinistia], and comparisons with other craniata. Brain, Behavior and Evolution 42 (Suppl. 1), 1–76. Owens, H., Bentley, A., Peterson, A., 2012. Predicting suitable environments and potential occurrences for coelacanths (Latimeria spp.). Biodiversity and Conservation 21, 577–587, http://dx.doi.org/10.1007/s10531-011-0202-1. Özesmi, S.L., Tan, C.O., Özesmi, U., 2006. Methodological issues in building, training, and testing artificial neural networks in ecological applications. Ecological Modelling 195 (1), 83–93. Pearson, R., Dawson, T., Berry, P., Harrison, P., 2002. Species: a spatial evaluation of climate impact on the envelope of species. Ecological Modelling 154 (3), 289–300 http://www.sciencedirect.com/science/article/ pii/S030438000200056X Pearson, R.G., 2012. Species distribution modeling for conservation educators and practitioners. Synthesis. American Museum of Natural History. Available at: http://ncep.amnh.org Peterson, A., Soberon, J., Pearson, R., Anderson, R., Martinez-Meyer, E., Nakamura, M., Araujo, M., 2011. Ecological Niches and Geographic Distributions (MPB-49). Vol. 49. Princeton University Press, New Jersey, USA. Ready, J., Kaschner, K., South, A.B., Eastwood, P.D., Rees, T., Rius, J., Agbayani, E., Kullander, S., Froese, R., 2010. Predicting the distributions of marine organisms at the global scale. Ecological Modelling 221 (3), 467–478 http://www. sciencedirect.com/science/article/pii/S030438000900711X Scardi, M., 2001. Advances in neural network modeling of phytoplankton primary production. Ecological Modelling 146 (1), 33–45. Scott, L.E., 2006. Atlas of southern African freshwater fishes. Vol.2. The South African Institute for Aquatic Biodiversity. Segurado, P., Arajo, M.B., 2004. An evaluation of methods for modelling species distributions. Journal of Biogeography 31 (10), 1555–1568, http://dx.doi.org/10.1111/j.1365-2699.2004.01076.x. Smith, J.L.B., 1939. A living fish of the mesozoic type. Nature 143, 455–456. Stockwell, D., 1999. The garp modelling system: problems and solutions to automated spatial prediction. International Journal of Geographical Information Science 13 (2), 143–158. Stokland, J.N., Halvorsen, R., Sta, B., 2011. Species distribution modelling - effect of design and sample size of pseudo-absence observations. Ecological Modelling, 222, http://dx.doi.org/10.1016/j.ecolmodel.2011.02.025. Thuiller, W., 2003. Biomod optimizing predictions of species distributions and projecting potential future shifts under global change. Global Change Biology 9 (10), 1353–1362, http://dx.doi.org/10.1046/j.1365-2486.2003.00666.x. VanDerWal, J., Shoo, L.P., Graham, C., Williams, S.E., 2009. Selecting pseudo-absence data for presence-only distribution modeling: how far should you stray from what you know? Ecological Modelling, 220, http://dx.doi.org/10.1016/j.ecolmodel.2008.11.010. Zaniewski, A., Lehmann, A., Overton, J.M., 2002. Predicting species spatial distributions using presence-only data: a case study of native new zealand ferns. Ecological Modelling 157 (23), 261–280 http://www.sciencedirect.com/ science/article/pii/S0304380002001990