slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
用 R 语言进行生物多样性分析 PowerPoint Presentation
Download Presentation
用 R 语言进行生物多样性分析

Loading in 2 Seconds...

play fullscreen
1 / 41

用 R 语言进行生物多样性分析 - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

用 R 语言进行生物多样性分析. 王绪高 中国科学院沈阳应用生态所. Diversity. Which one is more diverse?. Common data formats: (1) Presence/absence data, i.e., 0-1 data (2) Abundance data: number of individuals of each species

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '用 R 语言进行生物多样性分析' - emery


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

用R语言进行生物多样性分析

王绪高

中国科学院沈阳应用生态所

diversity
Diversity

Which one is more diverse?

slide3

Common data formats:

(1) Presence/absence data, i.e., 0-1 data

(2) Abundance data: number of individuals of each species

(3) Cover/biomass data. Cover/biomass are measurements often used in plant ecology. Biomass is occasionally used in insect, marine ecology etc.

But (2) and (3) can be converted into presence/absence data

There are many field methods for collecting the above data. Some common ones include:

1. Quadrat sampling/transect line

2. Trapping (light, pitfall, suction): Mainly used to collect insects. The common form is abundance data. Trapping data cannot be easily linked to sampling area because we do not know the area base where the insects come from.

3. Sighting/hearing (for surveying birds, mammals): Collecting presence/absence data, not accurate for abundance (count).

4. Capture-remark methods: Birds, mammals, fishes.

slide4

spcode abund

1 ACACME 1

2 ADE1TR 23

3 AEGIPA 4

4 ALCHCO 37

5 ALLOPS 10

6 ALSEBL 231

7 AMAICO 1

8 ANACEX 4

9 ANDIIN 9

10 ANNOSP 4

11 APEIME 47

12 APEITI 4

13 ASPICR 10

14 AST1ST 42

15 AST2GR 13

16 BEILPE 77

17 BROSAL 48

18 CALOLO 14

19 CASEAC 3

20 CASEAR 15

……

A basic data form

slide5

Diversity indices

Species diversity consists of two fundamental components: abundance and richness.

Suppose x = (x1, x2, …, xS) is a sample of abundance of a community, where S is the number of species.

1. Abundance: the number of individuals of a species in a given area. The total number of abundance is N = x1 + x2 + … +xS = sum(xi)

2. Richness: the number of species in a given area which is S.

These two components are not independent of each other, they are related.

Most diversity indices are quantitative combinations of abundance and richness in such a way that richness is weighted by relative abundance of each species.

slide6

Diversity indices

3. The Shannon index, H

4. The Simpson index, D

The Shannon and Simpson indices are the two most widely used in the literature. The Shannon weighs towards rare species , while the Simpson weighs towards the abundant species.

slide7

Diversity indices

  • 5. The Margalef’s index
  • 6. The Menhinick’s index
  • 7. The McIntosh index, D
  • 8. The Berger-Parker index, d
  • 9. Brillouin index, HB
slide8

Relationship among the indices

Many indices are not independent but related. Hill demonstrates their relationship.

where Da is the a-th order of diversity, pi is the proportional abundance of the n-th species. It follows that D0 = number of species

D1 = exponential Shannon index

D2 = the Simpson index

D = the Berger-Parker index

Proof: the Shannon index at a -> 1 (use l’Hôpital rule):

Hill, M.O. 1973. Diversity and evenness: a unifying notation and its consequences. Ecology 54:427-431.

slide9

Evaluation and choice of diversity indices

  • Two criteria of “good” indices:
  • High discriminating power: The ability to detect subtle (not unduly) differences between samples. This is an important criterion because one of the major applications of diversity measures is to gauge the effects of environmental changes (pollution or other disturbances) on communities.
  • Independent of sample size: This criterion is most commonly used to judge whether an index is satisfactory or not. A good index must be relatively independent (no indices are truly independent of sample size) of sample size so that one can make sure that the index estimated from relatively small samples will represent the true community.
  • There is little concensus on which indices are “good” (let alone “best”). In general, indices can be divided into two types:
  • Type 1-- Indices weighted towards species richness (or rarity): Richness, the Margalef, the Menhinick’s, the Shannon index, the Brillouin, logseries a index, and the lognormal l.
  • Type 2 – Indices weighted towards species dominance/evenness (or abundance of species): the Simpson, the McIntosh D, and the Berger-Parker indices.
sample
Sample
  • Single sampling ~ square

sample.ran.squ=function(data, side, plotdim=c(1000,500))

{

xlo=runif(1,min=0,max=plotdim[1]-side)

ylo=runif(1,min=0,max=plotdim[2]-side)

xhi=xlo+side

yhi=ylo+side

randsample=subset(data, gx>=xlo&gx<xhi&gy>=ylo&gy<yhi)

no.ind=length(randsample$sp)

no.spp=length(unique(randsample$sp))

return(c(side^2/1e4,no.ind,no.spp))

}

sample1
Sample
  • Single sampling ~ rectangular

sample.ran.rect=function(data, side.x, side.y, plotdim=c(1000,500))

{

xlo=runif(1,min=0,max=plotdim[1]-side.x)

ylo=runif(1,min=0,max=plotdim[2]-side.y)

xhi=xlo+side.x

yhi=ylo+side.y

randsample=subset(data, gx>=xlo&gx<xhi&gy>=ylo&gy<yhi)

no.ind=length(randsample$sp)

no.spp=length(unique(randsample$sp))

return(c(side.x*side.y/1e4,no.ind,no.spp))

}

sample2
Sample

Single sampling ~ circle

sample.ran.circle=function(data, radius, plotdim=c(1000,500),graph=T)

{

xlo=runif(1,min=radius,max=plotdim[1]-radius)

ylo=runif(1,min=radius,max=plotdim[2]-radius)

num.ind=length(data$gx)

place=numeric()

dist=((data$gx-xlo)^2+(data$gy-ylo)^2)^0.5

dist.n=dist<=radius

place=(1:num.ind)[dist.n]

sample.new=data[place,]

sample.new=sample.new[is.na(sample.new$tag)==F,]

if(graph){

plot(data$gx,data$gy,xlab="x",ylab="y")

points(sample.new$gx,sample.new$gy,type="p", col="blue")

}

cat("Random sample point is",c(xlo,ylo),sep=" ","\n")

return(sample.new)

}

sample3
Sample

#################

spp.area=function(data,sides)

{

result=list()

nrow=length(sides)

for (i in 1:nrow)

result[[i]]=spp.area.onetime(data,sides[i])

fullresult=matrix(nrow=0,ncol=5)

for(i in 1:length(result))

{

fullresult=rbind(fullresult,result[[i]])

}

colnames(fullresult)=c("area","ind","spp","shannon","simpson")

fullresult=data.frame(fullresult)

par(mfrow=c(2,2))

plot(fullresult$area,fullresult$spp,col="red")

lines(fullresult$area,fullresult$spp,col="green")

plot(fullresult$area,fullresult$ind,col="blue")

lines(fullresult$area,fullresult$ind,col="green")

plot(fullresult$area,fullresult$shannon,col="black")

lines(fullresult$area,fullresult$shannon,col="black")

plot(fullresult$area,fullresult$simpson,col="black")

lines(fullresult$area,fullresult$simpson,col="black")

return(fullresult)

}

  • Nest sampling

spp.area.onetime=function(data,side)

{

randsample=subset(data,data$gx>=0&data$gx<side&data$gy>=0&data$gy<side)

no.ind=length(randsample$sp)

no.spp=length(unique(randsample$sp))

abund=numeric()

splist=unique(randsample$sp)

for (i in 1:no.spp)

{abund[i]=length(randsample$sp[randsample$sp==splist[i]])

}

p=abund/sum(abund)

shannon=-sum(p*log(p))

simpson=1-sum(p^2)

return(c(side^2/1e4,no.ind,no.spp,shannon,simpson))

}

############

sample4

5×5

10×10

20×20

50×50

100×100

250×250

Sample
  • A grid system
  • sample.side=function(data,side,plotdim)
  • {
  • x=y=seq(0,plotdim[1]-side,by=side)
  • n=length(seq(0,plotdim[1]-side,by=side))
  • x1=rep(x,each=n)
  • y1=rep(y,n)
  • loc=data.frame(x1,y1)
  • no.ind=numeric()
  • no.spp=numeric()
  • for (i in 1:n^2)
  • {randsample=subset(data,data$gx>=loc[i,]$x1&data$gx<(loc[i,]$x1+side)&data$gy>=loc[i,]$y1&data$gy<(loc[i,]$y1+side))
  • no.ind[i]=length(randsample$sp)
  • no.spp[i]=length(unique(randsample$sp))}
  • return(data.frame(x=x1,y=y1,ind=no.ind,sp=no.spp))
  • }
slide15

Variogram

Given a geostatistical model, Z(s), its variogram g(h) is formally defined as

where f(s, u) is the joint probability density function of Z(s) and Z(u).

For an intrinsic random field, the variogram can be estimated using the method of moments estimator, as follows:

where h is the distance separating sample locations si and si+h, N(h) is the number of distinct data pairs. In some circumstances, it may be desirable to consider direction in addition to distance. In isotropic case, h should be written as a scalar h, representing magnitude.

slide16

1.2

sill = 1.0

0.8

g(h)

0.4

nugget = 0.2

range h = 5

0.0

0

2

4

6

8

10

h

Variogram

The main goal of a variogram analysis is to construct a variogram that best estimates the autocorrelation structure of the underlying stochastic process. A typical variogram can be described using three parameters:

Nugget effect – represents micro-scale variation or measurement error. It is estimated from the empirical variogram at h= 0.

Range – is the distance at which the variogram

reaches the plateau, i.e., the distance (if any)

at which data are no longer correlated.

Sill – is the variance of the random field V(Z),

disregarding the spatial structure. It is the

plateau where the variogram reaches at the

range, g(range).

variogram
Variogram

result=sample.side(bci,side=20,plotdim=c(500,500))

result[1:4,]

library(geoR)

abund=result[,1:3]

abund.geo=as.geodata(abund)

dist=seq(0,500,20)

abund.var=variog(abund.geo,dir=0,uvec=dist)

##compute semi-variance in other directions

plot(abund.var)

spp=result[,c(1,2,4)]

spp.geo=as.geodata(spp)

spp.var=variog(spp.geo,dir=0,uvec=dist)

variogram1
Variogram

library(geoR)

Tri2pa=subset(bci,sp==“TRI2PA”&dbh>0)

Tri2=Tri2pa[,3:5]

Tri2.geo=as.geodata(Tri2)

dist=seq(0,200,10)

var=variog(Tri2.geo,dir=0,uvec=dist)

#variog(coords=Tri2pa[,3:4],data=Tri2pa[,5],dir=0,uvec=dist)

plot(var)

variog4(Tri2.geo,uvec=dist)

varpart1
Varpart

library(vegan) ; varpart

species area relationship

Sampling species

Species-area curve

Number of species

area

Species-area relationship
species area relationship1

Logarithmic model

Michaelis-Menten

Power model

Logistic model

Species-area relationship

The generalized species-area model

A special case of the logistic model (z = 1):

The derivative describes the rate of change in the number of species with one unit of change in area. An important point to make: models change with the change of scale.

He, F. and Legendre, P. 1996. On species-area relationships. American Naturalist 148(4): 719-737

species area relationship2
Species-area relationship

# fit the logarithmic model #

logarithm.div.fn=function(data)

{##nsp=c+z*ln(area)##

area=data$area

nsp=data$spp

lm.out=lm(nsp~log(area))

lm.res=residuals(lm.out)

res.square=sum(lm.res^2)

prd.s=predict(lm.out)

cat("\n Logarithm model:residuals^2 = ", res.square, "\n")

return(list(prd.s=prd.s,sum=summary(lm.out)))

}

#######data here is the result of nest sampling####

species area relationship3
Species-area relationship

# fit the power model #

power.div.fn=function(data)

{##nsp=b*area^(z)

area=data$area

nsp=data$spp

nls.out=nls(nsp~b*area^z,start=c(b=10,z=0.5))

nls.res=residuals(nls.out)

res.square=sum(nls.res^2)

prd.s=predict(nls.out)

cat("\n Power model: residuals^2 = ", res.square, "\n")

return(list(prd.s=prd.s,sum=summary(nls.out)))

}

#######data here is the result of nest sampling####

species area relationship4
Species-area relationship

# fit the logistic model #

logist.div.fn=function(data)

{##nsp=b/(a+area(-z))##

area=data$area

nsp=data$spp

nls.out=nls(nsp~b/(a+area^(-z)),start=c(b=10,a=0.1,z=0.5))

nls.res=residuals(nls.out)

res.square=sum(nls.res^2)

prd.s=predict(nls.out)

cat("\n Logistic model: residuals^2 = ", res.square, "\n")

return(list(prd.s=prd.s,sum=summary(nls.out)))

}

#######data here is the result of nest sampling####

species area relationship5
Species-area relationship

div.area.r=function(data)

{

area=data$area

nsp=data$spp

prd.log.s=logarithm.div.fn(data)

prd.power.s=power.div.fn(data)

prd.logist.s=logist.div.fn(data)

par(mfrow=c(1,2))

plot(area,nsp, xlab="area of sample", ylab="number of species",type="b")

lines(area,prd.log.s$prd.s,col="blue")

lines(area,prd.power.s$prd.s,col="green")

lines(area,prd.logist.s$prd.s,col="red")

plot(log(area),nsp, xlab="log(area of sample)", ylab="number of species",type="b")

lines(log(area),prd.log.s$prd.s,col="blue")

lines(log(area),prd.power.s$prd.s,col="green")

lines(log(area),prd.logist.s$prd.s,col="red")

}

#######data here is the result of nest sampling####

species abundance relationship distribution sad
Species-abundance relationship/distribution (SAD)
  • SAD is simply the abundance of each species in a community. In no community do species have equal abundance. Instead, communities are almost always found to have many rare species and a few common ones. Ecologists believe that SAD is an ecological and evolutionary product which reflects the interactions of species, the effect of habitat adaptation and population fitness. Therefore, each community should have its own characteristic SAD.
  • SAD is important because
    • It completely describes the structure (species composition) of a community.
    • It provides evidence for understanding community assembly rules: how species come into coexistence? How species share resources? Why there are always many more rare species than common ones? Why some species have to be rare while others to be so abundant?
    • Species-abundance data provide an only solid basis to examine biodiversity (because SAD contains all the information about abundance of each species).
slide28

[1], [2-3], [4-7], [8-16], …

[1]/2,[1]/2+ [2]/, [2]/+[3]+[4]/2, [4]/2+[5-7]+[8]/2, …

cdf.obs[i]=length(abund[abund<=x[i]])

abund1=sort(abund.data,decreasing=T)

slide29
SAD

#############################

count1=function(abund.dat)

{

##data is bci$sp

splist=unique(abund.dat);abund=numeric()

nsp=length(splist)

for (i in 1:nsp)

{

abund[i]=length(abund.dat[abund.dat==splist[i]])

}

return(data.frame(splist=splist,abund=abund))

}

########################################

  • SAD. Preston1.bin
  • # plot RSA using Preston's 2^2 scale: (0, 1], (1, 3], (3, 7], (7,15],...
  • #Volkov's method for splitting boundary binnings to right and left bins
  • ##[1]/2, [1]/2+[2]/2,[2]/2+[3]+[4]/4,[4]/2+[5-7]+[8]/2,…..
  • #############
  • sp.abund=function(data)
  • {
  • ##data is bci$sp
  • sp=count1(data)$splist
  • abund=count1(data)$abund
  • spnum=length(sp)
  • Preston1.nsp=numeric()
  • Preston2.nsp=numeric()
  • number=ceiling(log2(max(abund)))
  • for (i in 1:number){
  • for (j in 2:i+1)
  • {
  • Preston1.nsp[i]=sum(length(sp[abund>=2^(i-1)&abund<=(2^i-1)]))
  • Preston2.nsp[1]=0.5*length(sp[abund==1])
  • Preston2.nsp[j]=sum(length(sp[abund>=2^(j-2)&abund<=(2^(j-1))]))-0.5*length(sp[abund==2^(j-2)])-0.5*length(sp[abund==2^(j-1)])
  • }
  • }
  • par(mfrow=c(2,2))
  • hist(abund,xlab="Abundance",ylab="Species frequency",main="RSA")
  • plot(c(1:number), Preston1.nsp, xlab="Abundance class", ylab="Species frequency", type="h", main="RSA: Preston Binning 1", lwd=4)
  • lines(c(1:number),Preston1.nsp,col="red")
  • plot(c(1:(number+1)), Preston2.nsp, xlab="Abundance class", ylab="Species frequency", type="h", main="RSA: Preston Binning 2", lwd=4)
  • lines(c(1:(number+1)),Preston2.nsp,col="green")
  • plot(1:length(sp), log(sort(abund, decreasing=T)), type="o", xlab="Species sequence", ylab="log(abundance)", main="Dominance Curve")
  • cat("\n observed number of species = ", spnum, "\n")
  • return(list(preston1.bin=Preston1.nsp,preston2.bin=Preston2.nsp))
  • }
sad logseries distribution

n = 1, 2, 3, …

….

SAD~logseries distribution
  • This model is first used by Fisher, Corbet & Williams (1943) to model species-abundance pattern of Malayan butterflies and Lepidoptera (butterflies, moths) caught by a light-trap at Rothamsted Experimental Station. It represents the first attempt to describe the mathematical relationship between the number of species and the number of individuals in those species. It is one of the two best known SAD models.
  • The frequency (the number of species) with n individuals in a community is
  • where a and x are two parameters, a > 0 and 0 < x < 1 (in most cases x > 0.9), but they are not independent (we will show that in a moment).
  • The terms of the logseries distribution are the number of species with one individual, two individuals, three individuals…:
slide31

n = 1, 2, 3, …

  • Estimating methods
  • MLE
  • Here pn is the probability, no longer frequency.
slide32

That is why it is called logarithmic series, while in fact is a power series dist.

S = 1 if the logseries model is described as probability rather than frequency.

Based on these terms, we can calculate the total number of species and total number of individuals.

1. Total number of species, S, is obtained by adding all the terms in the series:

This summation reduces to the following equation (how?):

2. The total number of individuals:

slide33

n = 1, 2, 3, …

  • Estimating methods
  • Moment method (what that means?)
  • For bci abundance data :
  • S = 321, N = 368122.
  • Results: a = 34.62111, x = 0.99991
parameter estimation
Parameter estimation

SAD~logseries

logseries.moment=function(alpha, abund){

###sp.abund=count1(bci$sp); abund=sp.abund$abund

# alpha is initial value, which is needed for the Newton method

# estimate the parameters for the logseries distribution

S=length(abund) # total number of species

N=sum(abund) # total number of individuals

optim.out=optim(alpha,optim.moment.fn, S=S,N=N)

print(optim.out)

alpha=optim.out$par

x=N/(N+alpha)

cat("\n alpha = ", round(alpha,5), "x = ", round(x,5), "\n")

}

#########################

# optimization function #

#########################

optim.moment.fn=function(alpha,S,N) {

# The function is: S=alpha*log(1+N/alpha)

abs(S-alpha*log(1+N/alpha))

}

parameter estimation1
Parameter estimation

SAD~logseries

Maximum likelihood estimates

logseries.mle.r=function(alpha, abund)

{

S=length(abund) # total number of species

N=sum(abund) # total number of individuals

optim.mle.out=optim(alpha,optim.mle.fn,,abund=abund)

alpha=optim.mle.out$par

x=1-exp(-1/alpha)

cat("\n alpha = ", round(alpha,5), "x = ", round(x,5), "\n")

}

#########################

# optimization function #

#########################

optim.mle.fn=function(alpha,abund) {

S=length(abund)

N=sum(abund)

obj.fn=S*log(alpha)+N*(log(1-exp(-1/alpha)))-sum(log(abund))

-obj.fn

}

sad lognormal distribution
SAD~ Lognormal distribution

Lognormal distribution

It is one of the two most widely used species-abundance model, first used by Preston in 1948. Different from the logseries model, it is a model for continuous random variable, i.e., abundance is considered as a continuous random variable (the model is therefore applicable to biomass and cover data). This assumption is approximately valid for species rich communities.

It has been found that many real communities follow the lognormal distribution. This is particularly true for species rich communities such as tropical rainforests.

If y = ln(x) is normally distributed y ~ N(s2, m), then x = exp(y) is lognormal. It has the form:

slide37

Estimation methods (MLE and moment estimate):

where m and s2 are the mean and variance of the normally distributed y which are the mean and variance of the log-transformed abundance data.

The mean and variance of the lognormal distribution are:

slide38

BCI data:

  • Steps to estimate m and s2:
  • Log-transform the abundance data,
  • Calculate the mean and variance of the log-transformed data: m = 4.8194, s2 = 5.7998,
  • Substitute the mean and variance into the lognormal model:
parameter estimation2
Parameter estimation

SAD~Lognormal distribution

mle.lognormal.r=function(data, mu, sigma)

{###sp.abund=count1(bci$sp); data=sp.abund$abund

# MLE estimation of lognormal distribution

# There are 2 parameters: mean (mu) and std (sigma)

startparam <- c(mu, sigma)

out <- optim(c(startparam[1], startparam[2]), fn = lognormal.llik.obj, x=data)

mu <- out$par[1]

sigma <- out$par[2]

AIC=out$value*2+2*length(startparam)

cat(" AIC= ",round(AIC,5), "\n")

cat(" mu = ", round(mu, 5), "\n")

cat(" sigma = ", round(sigma, 5), "\n")

}

####################################################

# The objective of the normal likelihood function #

####################################################

lognormal.llik.obj=function(param, x)

{

mu = param[1]

sigma = param[2]

s=length(x)

obj.fn=-0.5*s*log(2*pi)-s*log(sigma)-sum(log(x))-0.5*(sigma^(-2))*sum((log(x)-mu)^2)

-obj.fn

}

model validation
Model validation

AIC (Akaike Information Criterion)

  • AIC=-2L + 2k
    • L is the log-likelihood and k is the number of parameters in the models.
  • AICc=AIC+2k(k+1)/(n-k-1)
    • For small sample sizes(n): n/k<40
  • ∆AIC<2: equivalent;
  • ∆AIC 4-7: distinguishable;
  • ∆AIC>10: definitely different