sdm

 

getting started with sdm package

Babak Naimi, Miguel B. Araujo

3 April 2016 (update: 17 Oct. 2016)

sdm is an object-oriented, reproducible and extensible R platform for species distribution modelling. The sdm package is designed to create a comprehensive modelling and simulation framework that: 1) provides a standardised and unified structure for handling species distributions data and modelling techniques (e.g. a unified interface is used to fit different models offered by different packages); 2) is able to support markedly different modelling approaches; 3) enables scientists to modify the existing methods, extend the framework by developing new methods or procedures, and share them to be reproduced by the other scientists; 4) handles spatial as well as temporal data for single or multiple species; 5) employs high performance computing solutions to speed up modelling and simulations, and finally; 6) uses flexible and easy-to-use GUI interface. For more information, check the published paper by Naimi and Araujo (2016) in the journal of Ecography.

This document provides a very quick demonstration on the package followed by some examples, that would be helpful to get a guick start with the package.

Installing sdm and all the required packages

sdm can simply be installed using the standard install.packages function as:

install.packages(‘sdm’)

Depending on the methods are selected through the modelling and using the package, several packages may be needed, and therefore, should be installed on your machine. A quick way to install all the required packages (to guarantee having full functionaliy of sdm), is to simply use the function installAll offered by the sdm package. You can simply call it without any argument:

installAll()

A brief overview:

There are three main functions provide the main steps of developing/using species distibution models. The three steps include data preparation, model fitting/ evaluation, and prediction. The functions used for these steps:

  • sdmData: to read data from different formats, and prepare a data object. Both species (single or multiple) and explanatory variables can be introduced to the function, as well as other information such as spatial coordinates, time, grouping variables, etc.
  • sdm: to fit and evaluate species distribution models (multiple algorithms can be used)
  • predict: when models are fitted, they can be used to predict/project given a new dataset.

  • ensemble: when models are fitted, they can be used to predict/project given a new dataset and combined into a final prediction/projection.

Example Dataset:

The package is followed by several datasets that are used in the help pages. We also use one of those examples here for our demonstration:

We use a shapefile containing presence=absence records for a species as spatial points (species.shp), and four raster datasets (in Ascii format) as explanatory variables (predictors). The files are in the sdm library, so we can directly read them from the library folder. We use rgdal to read raster data (it supports different common formats), and raster package to handle the raster data.

library(sdm)

library(raster)
library(rgdal)

file <- system.file("external/species.shp", package="sdm") # get the location of the species shapefile

# so, file is simply a filename (with path):
file
# read the species shapefile using the function shapefile:

species <- shapefile(file)

class(species) # it is a SpatialPointsDataFrame
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
plot(species)

unnamed-chunk-2-1

# we can take a look at the head of attribute table in the species dataset:

head(species)
##   Occurrence
## 1          1
## 2          0
## 3          1
## 4          1
## 5          1
## 6          0
# you can see that there is a column containing the presence-absence records (i.e., Occurrence)

#--- we can plot presence and absence points separately with different colours:

plot(species[species$Occurrence == 1,],col='blue',pch=16)

points(species[species$Occurrence == 0,],col='red',pch=16)

unnamed-chunk-2-2

##########
# Let's read predictor variables (raster datasets)
# We have four Ascii-Grids, so, let's just take the name of all files ending to '.asc' to 
# be able to read them together. list.files function can be used to get a list of files in a given path:

path <- system.file("external", package="sdm") # path to the folder contains the data
# list the name of files in the specified path, match the pattern # (means all files with a name ending to asc). 
# We asked to generate full names (i.e., names with the path)
lst <- list.files(path=path,pattern='asc$',full.names = T) 

lst # this is the name of raster files we want to use as predictor variables
## [1] "/Library/3.3/Resources/library/sdm/external/elevation.asc"    
## [2] "/Library/3.3/Resources/library/sdm/external/precipitation.asc"
## [3] "/Library/3.3/Resources/library/sdm/external/temperature.asc"  
## [4] "/Library/3.3/Resources/library/sdm/external/vegetation.asc"
# stack is a function in the raster package, to read/create a multi-layers raster dataset
preds <- stack(lst) # making a raster object

preds # see the specification of the raster layers (e.g., cell size, extent, etc.)
## class       : RasterStack 
## dimensions  : 71, 124, 8804, 4  (nrow, ncol, ncell, nlayers)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## names       : elevation, precipitation, temperature, vegetation
plot(preds)

unnamed-chunk-2-3

plot(preds[[4]]) # only plot the 4th layer

plot(species,add=T) # let's add the species point on the previous plot

unnamed-chunk-2-4

Data preparation

So far, we used other packages to just read the data we need to use in our study. Now, we can use the sdm package. First, we need to read and put data in the package that creates a data object. This is very simple and efficient using the function .

In this function, we can specify the train dataset (can be spatial points or simply a data.frame), and predictors (if available separately as a raster object). In addition, if there is an independent dataset available to be used for measuring model performance (evaluation/validation), we can provide it through the test arguement. A formula can also be used to specify the response and explanatory variables (and in case if you have data.frame as input which contains coordinates, you can specify the coordinates in the formula as well, + more information). If the formula is not specified, the function tries to detect the species and predictor variables.

library(sdm)

d <- sdmData(formula=Occurrence~., train=species, predictors=preds)

d
## class                                 : sdmdata 
## =========================================================== 
## number of species                     :  1 
## species names                         :  Occurrence 
## number of features                    :  4 
## feature names                         :  elevation, precipitation, temperature, ... 
## type                                  :  Presence-Absence 
## has independet test data?             :  FALSE 
## number of records                     :  200 
## has Coordinates?                      :  TRUE
# we didn't really need the formula in this example, as it would be easy for the function to guess which 
# dataset is species, and which are predictors. So, it could be like this:
d <- sdmData(train=species, predictors=preds)

d
## class                                 : sdmdata 
## =========================================================== 
## number of species                     :  1 
## species names                         :  Occurrence 
## number of features                    :  4 
## feature names                         :  elevation, precipitation, temperature, ... 
## type                                  :  Presence-Absence 
## has independet test data?             :  FALSE 
## number of records                     :  200 
## has Coordinates?                      :  TRUE
# However, formula makes it so flexible to handle the variables, specifally if there are several other 
# information (e.g., time). If you have multiple species, you can have their names in the left hand side
# (e.g., sp1+sp2+sp3~.)

# You may also want to take part of variables:
d <- sdmData(formula=Occurrence~precipitation+temperature, train=species, 
                                  predictors=preds)

d
## class                                 : sdmdata 
## =========================================================== 
## number of species                     :  1 
## species names                         :  Occurrence 
## number of features                    :  2 
## feature names                         :  precipitation, temperature 
## type                                  :  Presence-Absence 
## has independet test data?             :  FALSE 
## number of records                     :  200 
## has Coordinates?                      :  TRUE
d <- sdmData(formula= ~., train=species, predictors=preds)


#---

Model Fitting and Evaluation

When you create the data object (d in the above example), you would be able to fit the models. To do so, you are going to use the function . In this function, you can specify the variables and the type of features can be generated based on, through a formula. In addition, the name of methods can be specifies as well as settings.

# in the following example, we use 3 different methods to fit the models.

m1 <- sdm(Occurrence~.,data=d,methods=c('glm','gam','brt'))

m1
## class                                 : sdmModels 
## ======================================================== 
## number of species                     :  1 
## number of modelling methods           :  3 
## names of modelling methods            :  glm, gam, brt 
## ------------------------------------------
## model run success percentage (per species)  :
## ------------------------------------------
## method          Occurrence       
## ---------------------- 
## glm        :        100   %
## gam        :        100   %
## brt        :        100   %
## 
## ###################################################################
## model performance (per species), using training test dataset:
## --------------------------------------------------------------------
## 
##  ## species   :  Occurrence 
## ======================
## 
## methods    :     AUC     |     COR     |     TSS     |     Deviance 
## -------------------------------------------------------------------
## glm        :     0.88    |     0.7     |     0.69    |     0.83     
## gam        :     0.88    |     0.71    |     0.7     |     0.82     
## brt        :     0.89    |     0.72    |     0.7     |     0.92
# as you can see, a report is generates shows how many percent of models were successful, and 
# their performance
#-------------------

# in the above example, the performance statistics were calculated based on the training dataset 
#(the data that were used to fit the mdoel). It is a better idea to have an independent dataset 
#(if so, we would specify in the test argument of sdmData). However, for most of cases, there is no such 
# data available, therefore, we can split the dataset as an alternative solution. Splitting (partitioning) can 
# be one time or several times (several replicates). There are also several methods to do that 
# (i.e., subsampling, cross-validation, bootsrapping)

# Here we are going to fit 5 models and evaluate them through 2 runs of subsampling, each draw 30 percent
# of training data as test dataset:

m2 <- sdm(Occurrence~.,data=d,methods=c('rf','tree','fda','mars','svm'),replicatin='sub',
                          test.percent=30,n=2)

m2
## class                                 : sdmModels 
## ======================================================== 
## number of species                     :  1 
## number of modelling methods           :  5 
## names of modelling methods            :  rf, cart, fda, mars, svm 
## replicate.methods (data partitioning) :  subsampling 
## number of replicates (each method)    :  2 
## toral number of replicates per model  :  2 (per species) 
## test percentage (in subsampling)      :  30 
## ------------------------------------------
## model run success percentage (per species)  :
## ------------------------------------------
## method          Occurrence       
## ---------------------- 
## rf         :        100   %
## cart       :        100   %
## fda        :        100   %
## mars       :        100   %
## svm        :        100   %
## 
## ###################################################################
## model Mean performance (per species), using test dataset (generated using partitioning):
## --------------------------------------------------------------------
## 
##  ## species   :  Occurrence 
## ======================
## 
## methods    :     AUC     |     COR     |     TSS     |     Deviance 
## -------------------------------------------------------------------
## rf         :     0.88    |     0.7     |     0.73    |     0.87     
## cart       :     0.82    |     0.57    |     0.62    |     1.6      
## fda        :     0.9     |     0.73    |     0.72    |     0.77     
## mars       :     0.87    |     0.69    |     0.72    |     1.26     
## svm        :     0.88    |     0.75    |     0.8     |     1.13
getModelInfo(m2) # info on runs including modelID, whether they are successfully fitted and evaluated, etc.
##    modelID    species method replication replicationID success training
## 1        1 Occurrence     rf subsampling             1    TRUE     TRUE
## 2        2 Occurrence     rf subsampling             2    TRUE     TRUE
## 3        3 Occurrence   cart subsampling             1    TRUE     TRUE
## 4        4 Occurrence   cart subsampling             2    TRUE     TRUE
## 5        5 Occurrence    fda subsampling             1    TRUE     TRUE
## 6        6 Occurrence    fda subsampling             2    TRUE     TRUE
## 7        7 Occurrence   mars subsampling             1    TRUE     TRUE
## 8        8 Occurrence   mars subsampling             2    TRUE     TRUE
## 9        9 Occurrence    svm subsampling             1    TRUE     TRUE
## 10      10 Occurrence    svm subsampling             2    TRUE     TRUE
##    test.dep test.indep
## 1      TRUE      FALSE
## 2      TRUE      FALSE
## 3      TRUE      FALSE
## 4      TRUE      FALSE
## 5      TRUE      FALSE
## 6      TRUE      FALSE
## 7      TRUE      FALSE
## 8      TRUE      FALSE
## 9      TRUE      FALSE
## 10     TRUE      FALSE
# We can generate the roc curve and compare the results for all models:
roc(m2)

unnamed-chunk-4-1

# the plots can be smoothes:
roc(m2,smooth=T)

unnamed-chunk-4-2

Prediction

We can use the output of fitting, to predict into the study area, or project into a new location or a new time.

The predict function can be used for this purpose:

# in the following, we just predict the habitat suitability into the whole study area
# since the newdata is a raster object, the output is also a raster object

p1 <- predict(m1,newdata=preds,filename='p1.img') 
# many commonly used raster format is supported (through the package rgdal)

plot(p1)

unnamed-chunk-5-1

p2 <- predict(m2,newdata=preds,filename='p2.img')

p2
## class       : RasterBrick 
## dimensions  : 71, 124, 8804, 10  (nrow, ncol, ncell, nlayers)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : /Users/bnaimi/Dropbox/R_Books_Docs/r-gis.net/_sdm_rforge/sdm/pkg/sdm/vignettes/p2.img 
## names       : id_1.sp_1.m_rf.re_subs, id_2.sp_1.m_rf.re_subs, id_3.sp_1.m_cart.re_subs, id_4.sp_1.m_cart.re_subs, id_5.sp_1.m_fda.re_subs, id_6.sp_1.m_fda.re_subs, id_7.sp_1.m_mars.re_subs, id_8.sp_1.m_mars.re_subs, id_9.sp_1.m_svm.re_subs, id_10.sp_1.m_svm.re_subs 
## fullname    : id_1-species_Occurrence-method_rf-replication_subsampling, id_2-species_Occurrence-method_rf-replication_subsampling, id_3-species_Occurrence-method_cart-replication_subsampling, id_4-species_Occurrence-method_cart-replication_subsampling, id_5-species_Occurrence-method_fda-replication_subsampling, id_6-species_Occurrence-method_fda-replication_subsampling, id_7-species_Occurrence-method_mars-replication_subsampling, id_8-species_Occurrence-method_mars-replication_subsampling, id_9-species_Occurrence-method_svm-replication_subsampling, id_10-species_Occurrence-method_svm-replication_subsampling
nlayers(p2)
## [1] 10
plot(p2[[1:4]]) # plot the first 12 rasters

unnamed-chunk-5-2

# we can take the mean of raster over different runs for each method and species:
p2m <- predict(m2,newdata=preds,filename='p2m.img',mean=T)

p2m
## class       : RasterBrick 
## dimensions  : 71, 124, 8804, 5  (nrow, ncol, ncell, nlayers)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : /Users/bnaimi/Dropbox/R_Books_Docs/r-gis.net/_sdm_rforge/sdm/pkg/sdm/vignettes/p2m.img 
## names       : sp_1.m_rf.re_subs, sp_1.m_cart.re_subs, sp_1.m_fda.re_subs, sp_1.m_mars.re_subs, sp_1.m_svm.re_subs 
## min values  :      1.084167e-02,        0.000000e+00,       2.197873e-03,        3.262828e-05,      -1.135510e-01 
## max values  :         0.9937250,           0.9878049,          0.9968304,           0.9983699,          1.0777908 
## fullname    : species_Occurrence-method_rf-replication (Mean)_subsampling, species_Occurrence-method_cart-replication (Mean)_subsampling, species_Occurrence-method_fda-replication (Mean)_subsampling, species_Occurrence-method_mars-replication (Mean)_subsampling, species_Occurrence-method_svm-replication (Mean)_subsampling
plot(p2m)

unnamed-chunk-5-3

# full names of rasters:
getZ(p2m)
## [1] "species_Occurrence-method_rf-replication (Mean)_subsampling"  
## [2] "species_Occurrence-method_cart-replication (Mean)_subsampling"
## [3] "species_Occurrence-method_fda-replication (Mean)_subsampling" 
## [4] "species_Occurrence-method_mars-replication (Mean)_subsampling"
## [5] "species_Occurrence-method_svm-replication (Mean)_subsampling"

Ensemble forecasting

Studies have shown that predictions or projections by alternative models can be so variable that challenge the common practice of relying on one single method. A solution is to utilise several models (‘ensembles’) and use appropriate techniques to explore the resulting range of projections. Significant improvements on the robustness of a forecast can be achieved if an ensemble approach is used and the results analysed appropriately.

In the sdm package, the ensemble function can be used to generate an ensemble prediction or forecast based on the multiple models that are used in the sdm function. Several methods are implemented and can be used by a user in a flexible way. Here is an example:

# in the following, we predict the habitat suitability using the ensemble function
# since the newdata is a raster object, the output is also a raster object

# ensemble based on a Weighted averaging that is weighted using AUC statistic
e1 <- ensemble(m1,newdata=preds,filename='e1.img',setting=list(method='weighted',stat='AUC')) 

plot(e1)

ens1

# ensemble based on a Weighted averaging that is weighted using TSS statistic 
# with threshold criterion number 2 which is max(Sensitivity+Specificity) or max(TSS)
e2 <- ensemble(m2,newdata=preds,filename='e2.img',
                                   setting=list(method='weighted',stat='TSS',opt=2))

e2
## class       : RasterLayer 
## dimensions  : 71, 124, 8804  (nrow, ncol, ncell)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : /Users/bnaimi/Dropbox/R_Books_Docs/r-gis.net/_sdm_rforge/sdm/pkg/sdm/vignettes/e2.img 
## names       : e2 
## values      : 0.02094244, 0.9899206  (min, max)
plot(e2)

ens2

# ensemble based on an Unweighted averaging
e3 <- ensemble(m2,newdata=preds,filename='e3.img',setting=list(method='unweighted'))

plot(e3)

ens3

there are other options in the setting argument that can specified by a used, for example, one may define a numeric verctor as weight, or specify the id of some models that should be incorporated into the ensemble procedure.

Graphical User Interface (GUI)

The sdm package is followed by graphical user interfaces (GUIs) that make the package easy to use and user friendly. Using the gui function, a user can get access to the results of sdm in an interactive way, or it can be used to prepare data or fit SDMsthrough GUIs.

Following example shows how the results generated by the sdm function can be explored in GUIs:

# m2 was the output of sdm function in the above examples, Then one can use gui to explore everything inside m2:

gui(m2)

screenshot-2016-10-17-03-58-13

screenshot-2016-10-17-03-59-04  screenshot-2016-10-17-03-58-44 screenshot-2016-10-17-03-58-33screenshot-2016-10-17-03-59-43

Reference

Naimi, B., Araujo, M.B. (2016) sdm: a reproducible and extensible R platform for species distribution modelling, Ecography, 39:368-375, DOI: 10.1111/ecog.01881

Join the package mailing list on “sdm R” google group using this link