Assessing spatial distribution of Coffea arabica L. in Ethiopia's highlands using species distribution models and geospatial analysis methods

Assessing spatial distribution of Coffea arabica L. in Ethiopia's highlands using species distribution models and geospatial analysis methods

Accepted Manuscript Assessing spatial distribution of Coffea arabica L. in Ethiopia's highlands using species distribution models and geospatial analy...

3MB Sizes 1 Downloads 45 Views

Accepted Manuscript Assessing spatial distribution of Coffea arabica L. in Ethiopia's highlands using species distribution models and geospatial analysis methods

Binyam Tesfaw Hailu, Mika Silijander, Eduardo E. Maeda, Petri Pellikka PII: DOI: Reference:

S1574-9541(16)30082-6 doi:10.1016/j.ecoinf.2017.10.001 ECOINF 803

To appear in:

Ecological Informatics

Received date: Revised date: Accepted date:

6 July 2016 2 October 2017 4 October 2017

Please cite this article as: Binyam Tesfaw Hailu, Mika Silijander, Eduardo E. Maeda, Petri Pellikka , Assessing spatial distribution of Coffea arabica L. in Ethiopia's highlands using species distribution models and geospatial analysis methods. The address for the corresponding author was captured as affiliation for all authors. Please check if appropriate. Ecoinf(2017), doi:10.1016/j.ecoinf.2017.10.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Assessing spatial distribution of Coffea arabica L. in Ethiopia’s highlands using species distribution models and geospatial analysis methods

Corresponding Author

T

1. Binyam Tesfaw Hailu [email protected]; School of Earth Sciences, Addis Ababa University

CR

IP

2. Mika Silijander [email protected]; Department of Geosciences and Geography, University of Helsinki 3. Eduardo E. Maeda [email protected] ; Department of Environmental Sciences, University of Helsinki

AN

US

4. Petri Pellikka [email protected]; Department of Geosciences and Geography, University of Helsinki. Best regards, Binyam

Abstract

AC

CE

PT

ED

M

Though there is an increase in popularity of predictive modelling for assessing the geographical distribution of species, there is still a clear gap on explaining geospatial methods to derive the presence/absence of species in terms of geospatial extent besides the ambiguity of robust models. In this paper, we evaluate four major species distribution modelling methods: Artificial Neural Network (ANN), Support Vector Machines (SVM), Maximum Entropy (MaxEnt) and Generalized Linear Model (GLM) with pseudo absence and background absence data. To investigate the efficacy of these models, we present a case study using Coffea arabica L. species in Ethiopia as there was no species distribution modelling that has been done at a local scale especially in the coffee growing areas. We made predictions on 75% subsets and validation on 25% of the 112 presence of the species records that were collected from field observation and 0.5 meter spatial resolution of true colour aerial photographs. Twelve biophysical explanatory variables; climatic, remote sensing based and landscape variables were employed in modelling. The results show that MaxEnt with pseudo absence data and SVM with background absence have highest area of understory coffee presence prediction with 12.2 % and 23.1% area coverage of indigenous forest, respectively. The result from the model performance test using True Positive Rate (TPR) shows that GLM and SVM with pseudo absence data performed highest (TPR=0.821). MaxEnt and SVM were the robust modelling methods (TPR=0.964) using background absence data. .

1

ACCEPTED MANUSCRIPT 1. Introduction

ED

M

AN

US

CR

IP

T

Understanding the spatial distribution of any given species has been the focus of scientific research for centuries. Early studies such as those conducted by Von Humboldt and Bonpland (1805) attempted to give details on the reasons for how some species are present in some places and absent in others. Since mid-1980s, a new concept of quantitative species spatial distribution modelling (SSDM) has emerged, which uses geographic information systems (GIS) and remote sensing along with the increase of advanced ecological applications. Currently, vast numbers of applications exist at the local, regional, and global scales (Guisan and Zimmermann, 2000; Rushton et al., 2004; Elith, J., 2006). SDMs, also known as Ecological Niche Models, are now widely used to /indicate/ascertain/ the ecological requirements of species and also to predict their geographical distributions (Elith and Leathwick, 2009) SDMs are also used for the evaluation of mapping the spread of invasive species (Shatz et al., 2013). For example, one of this class of environmental (or ecological) niche model, the Maximum Entropy (MaxEnt) model, predicts the relative suitability of species habitat The MaxEnt model does this by using a correlation approach that evaluates the environmental conditions that satisfy a species ecological requirements (Warren and Seifert, 2011; Chemura et al., 2015). The assumption of niche theory is that uni-modal curves with symmetric Gaussian-shaped and close connection with the continuum concept when it is applied to plants (Austin and Smith, 1989). Recent evidence, Austin (2005), supports the incidence of unimodal response curves with diverse skewed asymmetric or symmetric shapes for mapping plant species. Comparisons of methods seldom use parameterization such as multiple linear versus curvilinear terms, or an ordinary type of data (presence/absence or counts), or predictors. Consequently, identifying the best SDM methodology for specific cases is often subjective.

AC

CE

PT

Coffee is one of the most valuable traded commodities and it has been the second most traded commodity after oil in the post-World War II period (Davis et al., 2012; Ponte, 2002). In 2014, coffee accounted for an estimated US$ 13.9 billion of exports as some 5 billion kg were shipped and about 26 million people in 52 producing countries were engaged in total coffee sector employment (ICO, 2015). Coffea arabica (C. arabica) originally comes from the highlands of Ethiopia (Davis et al., 2012; Teketay, 1999). Specifically, highlands of south-western Ethiopia has C. arabica with most diverse populations (Gole et al., 2008). Coffee is the backbone for the economy of Ethiopia as it contributes 41% of total foreign exchange earnings in 2005 (IMF, 2006). C. arabica grows as a wild native understory shrub in the forests of Ethiopia’s highlands (Gole et al., 2008; Hernandez-Martinez et al., 2009). Ethiopia is the world’s third largest producer/exporter of C. arabica. It is home to at least 95% of the genetic resources for the species (Davis et al., 2012). The habitat of wild coffee is natural forest and the site researched in this study lies within the Eastern Afromontane Biodiversity Hotspot that is known for plant diversity and its large numbers of endemic species internationally. This region is also known for the imminent threat of habitat destruction (Gil et al., 2004). The montane rainforest with wild coffee occupies a 2

ACCEPTED MANUSCRIPT majority of the production area whereas plantation of coffee amount to only 6% of the total coffee production area in Ethiopia (Teketay, 1999). Local farmers either manage the wild coffee stands clearing some canopy trees and the competing undergrowth vegetation or harvest wild coffee fruits inside the natural forests. Coffee plants contribute to ecosystem processes provided by these forests that include habitat provisioning for a diverse wildlife community, soil conservation, and regulation of climate and atmospheric fluxes in carbon dioxide.

AN

US

CR

IP

T

Recent studies on Coffea arabica species distribution modelling used the MaxEnt method on a regional scale (Warren and Seifert, 2011; Davis et al., 2012). However, there seems to be a lack of studies about the Coffea arabica species on the local geographical scale and to the best of our knowledge there have been no published studies on this species that utilizes the other Species Distribution Model (SDM) modalities. The SDMs were used in this research to determine the Coffea arabica species distribution in Southwestern highlands of Ethiopia including MaxEnt. We evaluated the predictive capacity of SDMs for estimating the presence/absence of Coffea arabica species. The models were parameterized using geospatial variables i.e. the climatic variables were precipitation, minimum temperature, maximum temperature, and evapotranspiration. The remote sensing variables were: (Normalized Difference Vegetation Index (NDVI), Simple Ratio (SR), Shadow Fraction. and landscape variables: (distance to roads, distance to river, Digital Elevation Model (DEM), slope).

PT

ED

M

Mapping understorey coffee is very challenging because any cultivation that takes place under the canopies of scattered exotic trees as understory cultivations. First, we use object oriented classification of satellite image to map indigenous forests typically associated with understorey coffee shrubs. Second, we used four major predictive modelling methods: Artificial Neural Network (ANN), Support Vector Machines (SVM), Maximum Entropy (MaxEnt) and Generalized Linear Model (GLM) to associate the understorey coffee to forests and to the environmental covariates. Finally, we evaluate the efffectiveness of these models.

CE

2. Study Area

AC

The study area is located between latitude 7.95° and 8.08° North and longitude 36.3° to 36.5°East in upstream of Didessa river basin, which is located in south western Ethiopia and is a tributary of the Blue Nile River. The extent of the study area was 19 100 ha (Fig. 1). It has an altitude that varies between 1400 meters above sea level (m.a.s.l.) in the Didessa River downstream to 2400 m.a.s.l. in the upstream section. The topography is rugged with slopes between 0–50o. The population density is between 0.49 and 6.87 per square kilometre area. Mean temperatures over this area ranges between 17.5 and 20.5 at the lowest and highest altitudes, respectively. Rainfall in the area ranges from 144 mm/month in the downstream of the Didessa River to 161 mm/month in the natural forest. The study area is covered by natural forests, exotic forest plantations, agriculture areas and pastures. The false-colour SPOT 5 satellite image in Fig. 1 shows that the study area is mainly covered by forest (red colour), agricultural areas (light blue colour) and pasture (pink colour) land covers. 3

M

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

PT

3. Geospatial datasets

ED

Fig 1. Study area map: Ethiopian highlands (green colour) at the bottom left, SPOT5 satellite image scene (Path/Row of 134/133) at the bottom centre and the upper most map is the SPOT 5 image showing the Didessa valley and land cover.

3.1 Coffea arabica presence data

AC

CE

Coffea arabica presence data were used as a response variable in the SDM models. These data were collected from field observations in a coffee growing indigenous forest cover using hand held Global Positioning System (GPS) device with an accuracy of ±5-7m, and colour aerial photographs. The samples were taken randomly from indigenous forest areas, which have understory coffee. Coffea arabica data were randomly split into calibration (75%) and evaluation (25%) data sets that follow the so called “split-sample approach” suggested by Guisan and Zimmermann (2000). This approach splits the data sets used to adjust the model (calibration data), whereas the other (evaluation data) is used to evaluate the quality of model predictions from the splitting of what was originally a single data set. 3.2 Biophysical explanatory variables

4

ACCEPTED MANUSCRIPT Twelve biophysical explanatory variables with 20 m spatial resolution that covers 19 100 ha area were used in the modelling. Four of these explanatory variables were climatic variables, four remote sensing variables, and four landscape variables. 3.2.1 Climatic variables

AC

CE

PT

ED

M

AN

US

CR

IP

T

Precipitation (PT), minimum temperature (Tmin), and maximum temperature (Tmax) data that were acquired from WorldClim data (Hijmans et al., 2005) are the climatic variables incorporated in the modelling (Fig. 2). These data were clipped, projected to Adindan / UTM zone 37N co-ordinate system and resampled to 20 m. The objective of the resampling process was to match the spatial resolution of the climate variable raster with the other input dataset for the model but not augment the spatial information in the climate datasets. Therefore, it is recognized that this method have a feature of uncertainties associated with a coarser spatial scale of the original climate datasets.

Fig 2. Climatic variables. 3.2.2 Remote sensing variables The spectral vegetation indices derived from satellite remote sensing data include: Normalized Difference Vegetation Index (NDVI), Simple Ratio (SR), Topographic Wetness Index (TWI) and Shadow Fraction (Fig. 3). NDVI was calculated using the red band (R) and the near infrared 5

ACCEPTED MANUSCRIPT band (NIR) of multispectral SPOT image, which was also pre-processed by adjusting for variables such as atmospheric correction and topographic correction (Hailu et al., 2014) in order to get a reliable products including SR, SR and TWI. The red band shows a high chlorophyll pigment absorptions and the NIR band shows the high reflectivity of plants ('equation 1'). NDVI = (NIR − R)⁄(NIR + R)

(1)

T

NDVI value ranges between -1 to 1. However, in the study area, the ranges of NDVI value was from -0.05 to 0.85 (Hailu et al., 2015). The values close to zero corresponding to pixels with dry / bare soil and values ≥ 0.5 represents to dense vegetation.

CR

IP

The SR, which is calculated by the ratio of the near infrared reflectance and red reflectance as shown in equation 2 (Xavier et al., 2004). The SR value in the study area ranges between 0.16 and 18.98 (Hailu et al., 2015).. (2)

AC

CE

PT

ED

M

AN

US

SR = NIR/R

Fig 3. Remote sensing variables. The Shadow Fraction (SF) was created in ENVI software using Sequential Maximum Angle Convex Cone (SMACC) method (Gruninger et al., 2004). Topographic Wetness Index 'equation (3)' was developed by Beven and Kirkby (1979) within the runoff model TOP-MODEL. 6

ACCEPTED MANUSCRIPT TWI = ln(𝒂/tan𝜷)

(3)

PT

ED

M

AN

US

CR

IP

T

where 𝒂 is the local upslope area draining through a certain point per unit contour length and tan β is the local slope.

Fig 4. Landscape variables.

CE

3.2.3 Landscape variables

AC

The landscape variables were distance to roads (DTRo), distance to river (DTRi), Digital Elevation Model (DEM) and slope (Fig. 4). The roads and rivers were traced using the SPOT 5 satellite image acquired in 2011 and transformed to raster using the distance function in ArcGIS to use them as variables for the modelling. A digital elevation model (DEM) was processed as follows (Hailu et al., 2014): i) 1:50 000 scale topographic paper map sheets were obtained from the Ethiopian Mapping Agency (EMA) and scanned to convert them into a digital format. ii) Scanned map sheets were georeferenced with 20 meter interval contour lines using the on-screen digitization method in the ArcGIS software. iii) a 20 m planimetric resolution raster DEM was interpolated for the study area. This DEM was in a UTM Zone 37 projection with Adindan / UTM zone 37N Spheroid. This DEM was also used in this study for the topographic correction and orthorectification of the SPOT 5 satellite imagery. Slope was calculated from the 20 m DEM contours, describes the steepest down-hill slope for a location on a surface. That is, the maximum rate of change in elevation over each cell and its eight neighbours. Therefore, the higher the 7

ACCEPTED MANUSCRIPT slope value, the steeper is the terrain and vice versa. The range of slope values in the area varied from 0–44 degrees. 4. Modelling steps Five processing and analysis steps were performed: 1) collecting presence/absence data for Coffea arabica; 2) land cover mapping; 3) factors importance analysis; 4) species distribution modelling; and 5) geospatial analysis.

T

4.1 Presence/absence data collecting for Coffea arabica

M

AN

US

CR

IP

Two methods were used to collect Coffea arabica presence data. First, coffee presence places were identified with field surveys. Second, aerial photographs were used to collect Coffea arabica presence data to cover the whole study area to avoid sampling bias. High resolution aerial photographs were acquired during a flight campaign in October 2012 using a NIKON D3X colour digital camera equipped with a 14 mm lens, which produces a 78° opening angle as was described in detail in Hailu et al., (2014). The camera is part of the EnsoMOSAIC aerial imaging system that consisting of flight-planning software, navigation software, triggering unit, a global positioning system (GPS) and a power source (MosaicMill, 2013). Single image frames were mosaicked using EnsoMOSAIC software and the resulting mosaics were orthorectified, projected to Adindan / UTM zone 37N and data were generated with 0.5 m ground resolution. The geometric accuracy was within 2 m as verified in the field using GPS data. 4.2 Land cover mapping

AC

CE

PT

ED

Land cover mapping was carried out using object based image analysis that was applied in eCognition 8.7 software with (i) image segmentation and (ii) classification to detect the indigenous forests in the area. The classification was done based on the SPOT-5 satellite image, aerial photographs and field observations. The SPOT-5 satellite image was acquired by the HRG2 sensor (path 134/row 334) on 17 December 2008). There are two modes of acquisition: the first is the panchromatic mode that is within the range of 0.48–0.71 μm wavelengths having 2.5 m spatial resolution. The second, is the multispectral mode with 10 m spatial resolution, which has four bands, namely: Green band: 0.50–0.59 μm, Red band; 0.61–0.68 μm, Near infrared band: 0.78–0.89 μm, and Short-wave infrared band: 1.58–1.75 μm. This image was preprocessed by correcting for atmosphere, orthorectification, and topographic normalization using the ERDAS Imagine®2011 software. The atmospheric correction used dark object subtraction (DOS3) method (Chavez 1996) and topographic correction was performed by c-correction method (Teillet et al., 1982). It was necessary first to orthorectify the imagery utilizing a 20 metre planimetric resolution DEM because of the rugged terrain in the study area. A complete description of the methodology is given in Hailu et al. (2014). The satellite image in this study was used for driving surface reflectance, vegetation indices and land cover based predictors. The classification mainly focused on indigenous forest mapping because understory coffee (Coffea arabica) shrubs grow under the indigenous forest either wild or as plantations in Ethiopia (Schmitt et al., 2009; Aerts et al., 2012). The indigenous forest area (11158 ha) were 8

ACCEPTED MANUSCRIPT

M

AN

US

CR

IP

T

used for this research in order to mask the other land covers in the study area for the modelling (Fig. 5).

4.3 Factor importance analysis

ED

Fig 5. Land cover map of the study area.

CE

PT

Factor importance analysis (FIA) was used for selecting variables that determine the presence or absence of the species in a given pixel. This entailed that the input explanatory variables being tested in order to be spatially independent. The spatial dependence between pairs of variables was tested using the Pearson's correlation coefficients (Table 1). Factors that were greater than |±0.50| were considered to be auto-correlated and excluded from the explanatory variables. DEM, ET, PT and SR were not used as an input variables for the modelling.

AC

Pearson's Correlation Coefficient (Whole layer)

Table 1. Factors Importance analysis based on Pearson's correlation coefficients ( gray color shows high correlation ) Pearson's Correlation Coefficient (area with species) DEM DTRi DTRo ET NDVI PT DEM 1 0.78 0.03 -0.98 0.1 0.95 DTRi 0.29 1 0.11 -0.79 0.09 0.83 DTRo -0.01 -0.26 1 -0.03 0.07 0.15 ET -0.98 -0.3 0.01 1 -0.14 -0.95 NDVI 0.41 -0.03 0.25 -0.41 1 0.14 PT 0.92 0.4 -0.14 -0.91 0.31 1 SF -0.11 0 -0.08 0.1 -0.43 -0.06 Slp -0.16 -0.09 -0.01 0.17 -0.02 -0.15 SR 0.38 -0.06 0.25 -0.38 0.68 0.27 Tmax -0.98 -0.27 -0.02 0.99 -0.42 -0.89 Tmin -0.98 -0.3 -0.01 0.99 -0.41 -0.93

SF 0.21 0.22 -0.08 -0.21 -0.42 0.22 1 -0.1 -0.49 0.1 0.1

Slp -0.15 -0.11 -0.22 0.1 0.12 -0.15 -0.12 1 0.07 0.16 0.16

SR 0.03 0.08 0.07 -0.09 0.56 0.06 -0.43 0.3 1 -0.39 -0.38

Tmax -0.97 -0.74 -0.02 0.99 -0.14 -0.93 -0.2 0.09 -0.1 1 0.01

Tmin -0.98 -0.78 -0.04 0.99 -0.15 -0.95 -0.21 0.12 -0.09 0.09 1

TWI -0.2 -0.19 -0.19 0.11 0.08 -0.21 0.11 0.06 0.01 0.09 0.11

9

ACCEPTED MANUSCRIPT TWI

-0.01

0.01

0.06

0.01

-0.05

-0.3

0.12

-0.21

-0.08

0.01

0.01

1

4.4 Species distribution modelling

US

CR

IP

T

Four predictive models were evaluated: GLM, ANN, MaxEnt and SVM. We used ModEco (Guo and Liu, 2010) as the platform for pseudo-absence based and background based models in this research. MaxEnt, SVM, GLM, and ANN were implemented as background-based models and pseudo-absence based models. The difference between the two is the background-based models one sample the “pseudo-absence data” is used to represent the whole study area, which results in certain types of conditional probability arising depending on the models used (Phillips and Dudı´k, 2008). Out of 112 random presence points that were collected during field survey and from 0.5m resolution aerial photographs, 84 points were used for the model training and 28 points were used for validating the models. A total of 168 absence points, which was twice the models' training data points were created randomly using background-based and pseudo-absence data in order to use all the models (Table 2 ).

PT

ED

M

AN

The Generalized Linear Model (GLM) uses a link function that transforms the scale of the dependent variable, thus enabling it to relax the distribution and constancy of variances assumptions that are commonly required for more standard linear models such as linear regression (McCullagh and Nelder, 1989). The GLM is commonly used to model dependent variables that are discrete distributions and are nonlinearly related to independent variables (Guisan et al., 2002). Consequently, the GLM model is particularly suitable for predicting species distributions, and has been proven to be successful in various ecological applications (Guisan et al., 2002). Three link functions are implemented in ModEco: logit link, log-log link, and complementary log-log link. In this study, logit function was used as a link function in the GLM model.

AC

CE

Artificial Neural Networks (ANN), according to Campbell (2002), simulate the human learning processes through establishment and reinforcement of linkages between the input and output data in the absence of training data. A number of ANN algorithms have been proposed, however, the current and the most used is an algorithm called the Multi-Layer Perceptron Neural Networks (MLP Neural Nets) with back-propagation (Özçelik et al., 2014). The architecture of the MLP Neural Nets consists of input, hidden, and output layers, each of which has a set of interconnected nodes (neurons) that work in parallel to transform the input data into output values (Lee and Evangelista, 2006; Conforti et al., 2014) as shown in Fig 6. ANNs have been commonly used to model complex relationships between the dependent variables, and independent variables. The idea of ANNs is to extract linear combinations of the input variable as derived features, and model the output as a nonlinear function of these derived features (Hastie et al., 2001). Moreover, ANNs have been successfully used to predict species distribution (Maravelias et al., 2003).

10

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

M

Fig 6. Multi-Layer Perceptron neural networks (MLP) with back-propagation architecture

AC

CE

PT

ED

MaxEnt measures how much choice is taken into consideration in the selection of an event and it is a fundamental concept in information theory (Shannon, 1948). The principle of maximum entropy specifies that the distribution model that fulfils any given constraints should be as reliable as possible (Phillips et al., 2004). This entails cautiously avoids assuming anything that has not been determined but agrees with everything that is known (Jaynes, 1990). Therefore the concept of MaxEnt is for recognizing the probability distribution of maximum entropy in order to approximate a target probability distribution. This is subject to a cluster of constraints that represent the incomplete information about the target distribution. According to Phillips et al. (2006), when MaxEnt is applied to presence-only species distribution modelling, i) the pixels of the study area constitute the space on which the MaxEnt probability distribution is defined, ii) pixels with known species occurrence records comprise the sample points, and iii) the features are climatic variables, elevation, vegetation type or other environmental variables with their associated functions. MaxEnt has been extensively used to map species distributions, especially when only presence data are available and when sample sizes are small (Elith et al., 2006). The Support Vector Machine (SVM) was originally developed by Vapnik (1995) and SVMs are considered to be a new generation of learning algorithms. SVMs distribution models have several useful characteristics: are statistically based models rather than loose analogies with natural learning systems and theoretically therefore guarantee performance (Cristianini and Scholkopf, 2002). According to Tso and Mather (2009) SVM is less sensitive to over-fitting of a

11

ACCEPTED MANUSCRIPT classifier during the training stage compared to the non-parametric artificial neural network (ANN) classification. MaxEnt Type: Prob-MaxEnt

T

Table 2. Parameter Settings of Species Distribution Models SVM ANN GLM Type: Type: Type: C-SVC BP-ANN Prob-GLM Kernel RBF* Training BP: Back Link Function LOGIT Gamma 0.5 method Propagation Threshold 0.5 Cost 1.000

CR

IP

Valid Presence points: 112, Presence Points: 84, Testing points: 28, Absence Points (randomly): 168 *Radial Basis Function

M

AN

US

The models were also tested for their predictive performance using True Positive Rate (TPR) or the Sensitivity method (Fielding and Bell, 1997) in the ModEco program that is based on the 28 validating points. TPR was calculated after implementing a confusion matrix for the four models that used pseudo-absence data. A confusion matrix includes information about predicted and actual classifications in a classification system (Kohavi and Provost, 1998). The performance of the classification was evaluated by the data in the confusion matrix as shown in Table 3 for a two class classifier (presence and absence). Accuracy (AC) is the number of predictions that were correct expressed as a percentage/proportion of the total. TPR is the proportion of presence cases that were correctly identified.

Absence Presence

Predicted Absence Presence A b C d

PT

Actual

ED

Table 3. Confusion matrix of the models, model performance

AC (a+d)/ (a+b+c+d)

Model Performance TPR d/(c+d) Kappa

AUC

CE

a and d are the numbers of correct predictions when an instance is negative and positive, respectively. b and c are the number of incorrect predictions when an instance is positive and negative, respectively.

AC

4.5 Geospatial analysis

The spatial extent of the predicted presence of understory coffee was determined for the result of each model in the study area. In addition to this, a geographical overlay analysis was executed in order to estimate the spatial extent of understory coffee present for all models. Geographic Overlay Analysis (GOA) used to measure how the SDMs interrelated for predicting the presence of understory coffee in the area. This was carried out by combining the model results found for each SDM using the combine tool in the ArcGIS 10.2 software. 5. Results 5.1 Understory Coffee (Coffea arabica) distribution Modelling

12

ACCEPTED MANUSCRIPT

T

Results of the four modelling methods with pseudo absence and background absence are shown in Fig. 7. The Gray color represent indigenous forest, whereas the red represents indigenous forest and predicted presence of understory coffee. The results show that MaxEnt with pseudo absence data and SVM with background absence have the highest area of understory coffee presence prediction with 12.2% and 23.1% area coverage of an indigenous forest, respectively (Table 4). On the other hand, the lowest understory coffee presence prediction with 8.0% and 6.4% of indigenous forest area was detected by ANN with pseudo absence and background absence data, respectively.

US

CR

IP

Table 4. Geospatial extent of Coffea arabica L. in the study area using different SDMs. Presence area (ha) Absence Area (ha) Presence Area Percentage (%) GLM 1158 10000 10.4 ANN 888 10270 8.0 MaxEnt 1361 9797 12.2 SVM 1174 9984 10.5 B-GLM 719 10439 6.4 B-ANN 1385 9773 12.4 B-MaxEnt 1399 9759 12.5 B-SVM 2581 8577 23.1

M

AN

Fig. 8 and Fig. 9 show that the geographical overlay analysis of all the four model resulted in both pseudo and background absence taking the predicted presence of understory coffee. The spatial extents of predicted presence of understory coffee after GOA occupies 1018 ha and 519 ha of the study area for background and pseudo absence data, respectively.

ANN

Outcome

MaxEnt

Outcome

SVM

Outcome

1 0 1 0 1 0 1 0

1 69 15 46 38 68 16 69 15

PT

Outcome

0 23 145 5 163 17 151 24 144

AC

TPR

AUC

Kappa

0.849

0.821

0.852

0.669

0.829

0.548

0.953

0.574

0.869

0.810

0.874

0.706

0.845

0.821

0.835

0.661

AC

CE

GLM

ED

Table 5. Confusion matrix for the four models and their performance measurements Model Performance Reference

5.2 Model Performance The confusion matrices of GLM, ANN, MaxEnt and SVM are shown in Table 5. According to these matrices the model performance test for the 28 test points show that the true positive rates of GLM, ANN, MaxEnt and SVM were 0.821, 0.548, 0.810 and 0.821, respectively. That is, both GLM and SVM models performed better (TPR=0.821) and ANN model showed the lowest performance (TPR=0.548). On the other hand, the performance for TPR of GLM, ANN, MaxEnt and SVM models, which used background absence were 0.821, 0.679, 0.964 and 0.964, respectively. This result shows that MaxEnt and SVM were the two most robust modelling methods (TPR=0.964 for both types) and ANN had the lowest (TPR=0.679). 13

AC

CE

PT

ED

M

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

Fig. 7. Model results based on pseudo absence basis of GLM, ANN, MaxEnt, SVM and background basis of GLM (B-GLM), ANN (B-ANN), MaxEnt (B-MaxEnt), SVM (B-SVM).

14

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Fig. 8. Geographical overlay Analysis map of SDMs with the background absence data.

Fig. 9. Geographical overlay Analysis map of SDMs with the pseudo absence data. 15

ACCEPTED MANUSCRIPT 6. Discussion

US

CR

IP

T

Coffea arabica is one of the approximately 100 species that belongs to the Coffea genus. This species along with with Coffea canephora (robusta coffee) accounts for 99% of the world coffee trade, and 70% of the coffee consumed in the world (Damatta and Ramalho, 2006). Although Coffee arabica originated from the southwest highlands of Ethiopia (Davis et al., 2012) and the country's economy is highly dependent on coffee (IMF, 2006), less attention has been given to the spatial distribution of coffee growing naturally or cultivated in Ethiopia. We show in this research that the spatial distribution, specifically the area coverage of Coffea arabica in a major coffee growing area of the south western highlands of Ethiopia using the four SDM models (GLM, ANN, MaxEnt and SVM). As shown in Table 2, 12.2% of the study area was covered by Coffea arabica species as determined by the MaxEnt model. This was the only modelling method used for estimating the distribution of Coffea arabica on a countrywide level in Ethiopia by Davis et al. (2012) in order to predicted the present and future distribution of indigenous Arabica coffee. The GLM, ANN and SVM have hitherto never been used for Coffea arabica species distribution in the country.

CE

PT

ED

M

AN

Mapping understory coffee is difficult because Coffea arabica growing and production takes place under indigenous trees (Hailu et al., 2015). Hailu et al. (2015) mapped the probabilistic presence of understory coffee in the Southwest highlands using remote sensing and climatic variables. In order to identify the indigenous forest in the area, Object Based Image Analysis (OBIA) was used in eCognition using high resolution satellite images, aerial photographs and field observations due to the presence of exotic trees planted (Hailu et al., 2014). The results of this present study build upon the previous studies of Hailu et al. (2014 and 2015), as we estimated the geographical extent of these shrubs in the area using the four species distribution models (GLM, ANN, MaxEnt and SVM). Though the outcome maps from these models look comparable, some small, but important, differences occur which are described in Table 3 as the geospatial extent for the presence and absence of Coffea arabica.

AC

A species distribution model is projected into geographic space, which gives a result on a geospatial extent of the predicted presence for the species as shown in Table 3 and it indicates the suitability for that species in ecological space. The results in Fig. 7 implies that the outputs of each model are spatially different in the same survey data of the species and associated explanatory variables. Different methods have different strengths and weaknesses and the choice of the appropriate method depends on the data, assumptions and goals of the exercise (Segurado and Araújo, 2004). However, in this research, different methods used the same data and goal have been taken in to consideration and compared. Consequently, GLM and SVM show the highest performance when using Background absence and SVM and MaxEnt models have high performance when using pseudo absence data. SVM in both cases of absence data, was the most robust method for predicting the presence of understory coffee in the area. These models show their relative suitability for understory plant species distributions such as Coffea arabica, which is grows under the canopies of indigenous trees in Ethiopia. 16

ACCEPTED MANUSCRIPT

IP

T

In ModEco, there were four options available to select pseudo or background absence data automatically, namely: (i) random, (ii) buffer area of presence points by specifying distance, iii) from an existing classification result map, and (iv) multi-dimensional envelop using expanded percentage. We chose the random selection method from these four methods to get absence data because it is based on the density estimate of known locations. This method was also recommended for all SDMs by Barbet-Massin et al. (2012). Moreover, this research identifies that the same number of background absence and pseudo absence data give different results when using the same SDMs and same number of presence data. For example, the predicted presence area of understory coffee using GLM with pseudo absence (1158 ha) was substantially different from background absence data (719ha).

PT

ED

M

AN

US

CR

Combining approaches of dynamic and static variables in SDMs predictions remain poorly understood and controversial (Brook et al., 2009). Most studies have used only bioclimatic data as explanatory variables to predict a species distribution (Pearson and Thuiller 2006; Hole et al., 2009; Carvalho et al., 2010). Non-climatic variables, such as elevation, slope and aspect alongside precipitation and temperature, were also incorporated in some studies (Peterson et al. 2002). We show in the present study that local scale mapping, which includes landscape variables (distance to roads, distance to river, elevation, slope) and remote sensing variables (NDVI, SR, TWI and Shadow Fraction), can give adequate determinations of the presence of a species. Moreover, this can be achieved without using bioclimatic variables or other important variables. This research will be continued in the future by investigating the distribution of Coffea arabica over larger areas. Such research will also consider the environmental variables identified by this research and use the most robust species distribution model. A limitation of the current research was that the result would be improved if the soil data had been incorporated into the modelling but such data were not available for our study area as it has low resolution.

AC

CE

The predicted geospatial extent of understory coffee presence using the four SDMs as shown in Fig 7 and Table 3 was based on the current climatic conditions. That is, we did not cover the future conditions of the presence of understory coffee in the area as it is beyond the scope this study. However, we recommend considering these climate data sets in further studies in order to predict the future geospatial extent and the existence of understory coffee in the area and for the whole country. 7. Conclusion

The extent and distribution of understory coffee (Coffea arabica) in the southwestern highlands of Ethiopia, are currently unknown, which prevents the optimal management of coffee production. We modeled the distribution of Coffea arabica using four SDMs. The highest area of understory coffee presence prediction was 12.2% using the MaxEnt model with pseudo absence data, and a 23.1% area coverage of indigenous forest using the SVM with background absence. The geographic overlay analysis result of the four models shows the spatial extent of predicted presence of understory coffee occupies 1018 ha and 519 ha of the study area when using background and pseudo absence data, respectively. The SDMs (GLM, ANN, MAxEnt and SVM) 17

ACCEPTED MANUSCRIPT were tested for one species, the same data (presence/absence and explanatory variables). Among the tested methods, SVM gave the best results (TPR=0.821 for pseudo absence and TPR=0.964 background absence data). Acknowledgements

CR

IP

T

The authors would like to thank the Climate Change Impacts on Ecosystem Services and Food Security in Eastern Africa (CHIESA) project, which is funded by the Ministry of Foreign Affairs of Finland. Dr Mika Siljander is currently funded by the Academy of Finland through The Adaptation for Ecosystem Resilience in Africa (AFERIA) project and Dr Eduardo Maeda is currently funded by the Academy of Finland. The authors are grateful to the school of Earth Sciences, Addis Ababa University for the facilities. References

AN

US

Aerts, R., Berecha G., Gijbels P., Hundera K., Van Glaberke S., Van-Depitte K., Muys B., Roldán-Ruiz I., Honnay O., 2012. Genetic variation and risks of introgression in the wild Coffea arabica gene pool in south-western Ethiopian montane rainforests. Evolutionary Applications 6(2), 243–252. doi:10.1111/j.1752–4571.2012.00285.x.

M

Austin, M.P., 2005. Vegetation and environment: discontinuities and continuities. In: van der Maarel, E. (Ed.). Vegetation Ecology. Blackwells Science Ltd., Oxford. 52–84.

ED

Austin, M. P., Smith, T. M., 1989. A new model for the continuum concept. Vegetatio 83, 35– 47.

PT

Barbet-Massin M., Jiguet, F., Albert C. H., Thuiller, W., 2012. Selecting pseudo-absences for species distribution models: how, where and how many? Methods in Ecology and Evolution British Ecological Society. doi: 10.1111/j.2041-210X.2011.00172.x

CE

Beven, K. J. & Kirkby, M. J., 1979. A physically based, variable contributing area model of basin hydrology. Hydrolological Sciences Bulletin 24, 43–69.

AC

Brook, B.W., Akçakaya, H.R., Keith, D.A., Mace, G.M., Pearson, R.G., Araújo, M.B., 2009. Integrating bioclimate with population models to improve forecasts of species extinctions under climate change. Biology Letters 5, 723–725. Campbell, J. B., 2002. Introduction to Remote Sensing. Taylor & Francis, London. Carvalho, S.B., Brito, J.C., Crespo, E.J., Possingham, H.P., 2010. From climate change predictions to actions conserving vulnerable animal groups in hotspots at a regional scale. Global Change Biology 16, 3257–3270. Conforti, M., Pascale, S., Robustelli, G., & Sdao, F., 2014. Evaluation of prediction capability of the artificial neural networks for mapping landslide susceptibility in the Turbolo River catchment, Northern Ca;abria, Italy. Catena 113, 236–250. 18

ACCEPTED MANUSCRIPT Cristianini, N., Scholkopf, B., 2002. Support vector machines and kernel methods - the new generation of learning machines. Ai Magazine 23, 31–41. Damatta, F. M., Ramalho, J. D. C., 2006. Impacts of drought and temperature stress on coffee physiology and production: a review. Brazilian Journal of Plant Physiology 18, 55–81.

T

Davis, A. P., Gole , T. W., Baena, S., Moat, J., 2012. The impact of climate change on indigenous Arabica Coffee (Coffea arabica): Predicting future trends and identifying priorities. PLos ONE 7(11), e47981. DOI:10.1371/journal.pone.0047981.

CR

IP

Elith, J., Leathwick J., 2009. Species distribution models: ecological explanation and prediction across space and time. Annual Review Ecological System 40, 677–697. doi: 10.1146/annurev.ecolsys.110308.120159.

US

Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferries, S., Guisan, A., et al., 2006. Novel methods improve prediction of species' distributions from occurrence data. Ecography 29, 129–151.

AN

Fielding, A.H., Bell J.F., 1997. A review of methods for the measurement of prediction errors in conservation presence/absence models. Environmental Conservation 24, 38–49.

M

Gil, P.R., Mittermeier, R.A., Hoffmann, M., Pilgrim, J., Goettsch-Mittermeier, C., Lamoreux, J., et al. (2004). Hotspots revisited. CEMEX, Mexico City.

PT

ED

Gole, T. W., Borsch, T., Denich, M., Teketay, D., 2008. Floristic composition and environmental factors characterizing coffee forests in southwest Ethiopia. Forest Ecology and Management 255, 2138–2150.

CE

Gruninger, J., Ratkowski A.J., Hoke, M.L., 2004. The sequential maximum angle convex cone (SMACC) endmember model. SPIE Proceeding, 5425-1.

AC

Guisan, A., Zimmermann N.E., 2000. Predictive habitat distribution models in ecology. Ecological Modelling 135, 147–186. Guisan, A., Edwards, T.C., Hastie, T., 2002. Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling 157, 89–100. Guo, Q., Liu, Y., 2010. ModEco: an integrated software package for ecological niche modelling. Ecography 33, 1–6. doi: 10.1111/j.1600-0587.2010.06416.x Hailu, B. T., Maeda, E. E., Hurskainen, P., Pellikka, P. P. K. E., 2014. Object-Based Image Analysis for Distinguishing Indigenous and Exotic Forests in Coffee Production Areas of Ethiopia. Applied Geomatics 6, 207–214. doi:10.1007/s12518-014-0136-x.

19

ACCEPTED MANUSCRIPT Hailu B. T., Maeda, E.E., Pellikka P., Pfeifer M., 2015. Identifying potential areas of understorey coffee in Ethiopia’s highlands using predictive modelling. International Journal of Remote Sensing 36(11), 2898-2919. Hastie, T., Tibshirani, R, Friedman, J., 2001. The elements of statistical learning: Data mining, inference and prediction., New York: Springer.

IP

T

Hernandez-Martinez, G., Manson, R.H., Hernandez, A.C., 2009. Quantitative classification of coffee agroecosystems spanning a range of production intensities in centeral veracruz, Mexico. Agriculture. Ecosystems & Environment 134, 89–98. http://dx.doi.org/10.1016/j.agee.2009.05.020.

US

CR

Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., Jarvis, S.A., 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25, 1965–1978.

AN

Hole,D.G., Willis, S.G., Pain, D.J., Fishpool, L.D., Butchart, S.H.M., & Collingham, Y.C., et al. 2009. Projected impacts of climate change on a continent-wide protected area network. Ecology Letters 12, 420–431.

M

International Coffee Organization (ICO), 2015. Trade Statistics. Available: http://www.ico.org/trade_statistics.asp?section=Statistics. Accessed 2015 Aug 13

ED

Jaynes, E.T., 1990. Notes on present status and future prospects. In: Grandy, W.T. and Schick, L.H. (eds.). Maximum Entropy and Bayesian Methods. Dordrecht, The Netherlands: Kluwer.

CE

PT

Kanellopoulos I., Wilkinson G. G., 1997. Strategies and best practice for neural network image classification, International Journal of Remote Sensing, 18:4, 711-725, DOI: 10.1080/014311697218719

AC

Kohavi, R., Provost, F., 1998. On Applied Research in Machine Learning. In Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Columbia University, New York , volume 30. Lee, S., Evangelista, D. G., 2006. Earthquake-induced landslide susceptibility mapping using an artificial neural network. Natural Hazards Earth System Science 6, 687–695. Maravelias, C. D., Haralabous, J. and Papaconstantinou, C., 2003. Predicting demersal fish species distributions in the mediterranean sea using artificial neural networks. Marine Ecology-Progress Series 255, 249–258. McCullagh, P. Nelder, J.A., 1989. Generalized Linear Models. 2nd ed. Chapman and Hall, New York. pp511. 20

ACCEPTED MANUSCRIPT McDonald, A. J., Gemmell, F. M., Lewis, P. E.,1998. Investigation of the utility of spectral vegetation indices for determining information on coniferous forests. Remote Sensing of Environment 66, 250–272. Pearson, R.G., Thuiller, W., Arau´ jo, M.B., Martinez-Meyer, E., Brotons, L., McClean, C., et al., 2006. Model based uncertainity in species range prediction. Journal of Biogeography 33, 1704–1711.

IP

T

Peterson, A.T., Ortega-Huerta, M.A., Bartley, J., Sanchez-Cordero, V., Soberon, J., Buddemeier, R.H., et al., 2002. Future projections for Mexican faunas under global climate change scenarios. Nature 416, 626–629.

CR

Phillips, S. Dudı´k, M., 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31, 161–175.

AN

US

Phillips, S., Dudik, M., Schapire, R.E., 2004. A maximum entropy approach to species distribution modelling. In: Proceedings of the twenty first international conference on machine learning, 2004 Banff, Alberta, Canada. 1015412: ACM, pp. 655–662.

M

Phillips, S.J., Anderson R.P., Schapire, R.E., 2006. Maximum entropy modelling of species geographic distribution. Ecological Modelling 190, 231–259.

ED

Ponte, S., 2002. The ‘latte revolution’? Regulation, markets and consumption in the global coffee chain. World Development. 30 1099–1122.

PT

Rushton, S.P., Ormerod, S.J., Kerby, G., 2004. New paradigms for modelling species distribution?. Journal of Applied Ecology 41, 193–200.

CE

Schmitt, C. B., Senbeta, F., Denich, M., Peisinger, H., Boehmer, H. J., 2009. Wild coffee management and plant diversity in the montane rainforest of southwestern Ethiopia. African Journal of Ecology 48, 78–86.

AC

Segurado, P., Araújo , M. B., 2004. An evaluation of methods for modelling species distributions. Journal of Biogeography 31, 1555–1568. Shatz, A. J., Rogan, J., Sangermano, F., Ogneva-Himmelberger, Y., & Chen, H., 2013. Characterizing the potential distribution of the invasive Asian longhorned beetle (Anoplophora glabripennis) inWorcester County, Massachusetts. Applied Geography 45, 259–268. Shannon, C. E.,1948. A mathematical theory of communication. The Bell system technical Journal 27, 379–423. Teketay, D., 1999. History, botany and ecological requirements of Coffee. Journal Ethiopian Wildlife and Natural History Society 20, 28–50. 21

ACCEPTED MANUSCRIPT Tso, B. , MatherP. M., 2009. Remote Sensing in the Optical and Microwave Regions, Classification Methods for Remotely Sensed Data. Second Edition. CRC Press. Vapnik, V., 1995. The nature of statistical learning theory, New York: Springer-Verlag. Van der Vossen, H.A.M., 1985. Coffee selection and breeding. In coffee: Botany, Biochemistry and production of beans and beverage. Clifford, M. N. and Willson, K. C. (Ed.), London: Croom Helm, 48–96.

IP

T

Vojislav, K., 2001. Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models (Complex Adaptive Systems).The IMT Press.

US

CR

Von Humboldt, A., Bonpland, A., 1805. Essai sur la géographie des plantes: accompagné d'un tableau physique des régions équinoxiales, fondé sur des mesures exécutées, depuis le dixième degré de latitude boréale jusqu'au dixième degré de latitude australe, pendant les années 1799, 1800, 1801, 1802 et 1803. Chez Levrault, Schoell et compagnie, libraires, XIII–1805. Paris.

M

AN

Warren D. L., Seifert S. N., 2011. Ecological niche modeling in MAXENT: the importance of model complexity and the performance of model selection criteria. Ecological Applications 21, 335–342.

AC

CE

PT

ED

Xavier, A. C., Vettorazzi, C.A., 2004. Mapping leaf area index through spectral vegetation indices in a subtropical watershed. International Journal of Remote Sensing 25(9), 1661– 1672.

22

ACCEPTED MANUSCRIPT Highlights 1. The extent of understory coffee (Coffea arabica L.) were mapped using geospatial data analysis, very high resolution satellite images and aerial photographs. 2. Integrated approaches of Species Distribution Models and geospatial analysis tools were implemented..

T

3. MaxEnt with pseudo absence data show 12.2% and Supper Vector Model (SVM) with

IP

background absence showed 23.1% area coverage of indigenous forest, respectively.

CR

4. Model performance test using TPR indicate that Generalized Linear Model and SVM (with 0.821 using pseudo absence data) and MaxEnt and SVM both were equally the

AC

CE

PT

ED

M

AN

US

most robust modelling methods (TPR=0.964 using background absence data).

23