用 R 语言进行生物多样性分析

用R语言进行生物多样性分析 王绪高中国科学院沈阳应用生态所

Diversity Which one is more diverse?

Common data formats: (1) Presence/absence data, i.e., 0-1 data (2) Abundance data: number of individuals of each species (3) Cover/biomass data. Cover/biomass are measurements often used in plant ecology. Biomass is occasionally used in insect, marine ecology etc. But (2) and (3) can be converted into presence/absence data There are many field methods for collecting the above data. Some common ones include: 1. Quadrat sampling/transect line 2. Trapping (light, pitfall, suction): Mainly used to collect insects. The common form is abundance data. Trapping data cannot be easily linked to sampling area because we do not know the area base where the insects come from. 3. Sighting/hearing (for surveying birds, mammals): Collecting presence/absence data, not accurate for abundance (count). 4. Capture-remark methods: Birds, mammals, fishes.

spcode abund 1 ACACME 1 2 ADE1TR 23 3 AEGIPA 4 4 ALCHCO 37 5 ALLOPS 10 6 ALSEBL 231 7 AMAICO 1 8 ANACEX 4 9 ANDIIN 9 10 ANNOSP 4 11 APEIME 47 12 APEITI 4 13 ASPICR 10 14 AST1ST 42 15 AST2GR 13 16 BEILPE 77 17 BROSAL 48 18 CALOLO 14 19 CASEAC 3 20 CASEAR 15 …… A basic data form

Diversity indices Species diversity consists of two fundamental components: abundance and richness. Suppose x = (x1, x2, …, xS) is a sample of abundance of a community, where S is the number of species. 1. Abundance: the number of individuals of a species in a given area. The total number of abundance is N = x1 + x2 + … +xS = sum(xi) 2. Richness: the number of species in a given area which is S. These two components are not independent of each other, they are related. Most diversity indices are quantitative combinations of abundance and richness in such a way that richness is weighted by relative abundance of each species.

Diversity indices 3. The Shannon index, H 4. The Simpson index, D The Shannon and Simpson indices are the two most widely used in the literature. The Shannon weighs towards rare species , while the Simpson weighs towards the abundant species.

Diversity indices • 5. The Margalef’s index • 6. The Menhinick’s index • 7. The McIntosh index, D • 8. The Berger-Parker index, d • 9. Brillouin index, HB

Relationship among the indices Many indices are not independent but related. Hill demonstrates their relationship. where Da is the a-th order of diversity, pi is the proportional abundance of the n-th species. It follows that D0 = number of species D1 = exponential Shannon index D2 = the Simpson index D = the Berger-Parker index Proof: the Shannon index at a -> 1 (use l’Hôpital rule): Hill, M.O. 1973. Diversity and evenness: a unifying notation and its consequences. Ecology 54:427-431.

Evaluation and choice of diversity indices • Two criteria of “good” indices: • High discriminating power: The ability to detect subtle (not unduly) differences between samples. This is an important criterion because one of the major applications of diversity measures is to gauge the effects of environmental changes (pollution or other disturbances) on communities. • Independent of sample size: This criterion is most commonly used to judge whether an index is satisfactory or not. A good index must be relatively independent (no indices are truly independent of sample size) of sample size so that one can make sure that the index estimated from relatively small samples will represent the true community. • There is little concensus on which indices are “good” (let alone “best”). In general, indices can be divided into two types: • Type 1-- Indices weighted towards species richness (or rarity): Richness, the Margalef, the Menhinick’s, the Shannon index, the Brillouin, logseries a index, and the lognormal l. • Type 2 – Indices weighted towards species dominance/evenness (or abundance of species): the Simpson, the McIntosh D, and the Berger-Parker indices.

Sample • Single sampling ~ square sample.ran.squ=function(data, side, plotdim=c(1000,500)) { xlo=runif(1,min=0,max=plotdim[1]-side) ylo=runif(1,min=0,max=plotdim[2]-side) xhi=xlo+side yhi=ylo+side randsample=subset(data, gx>=xlo&gx<xhi&gy>=ylo&gy<yhi) no.ind=length(randsample$sp) no.spp=length(unique(randsample$sp)) return(c(side^2/1e4,no.ind,no.spp)) }

Sample • Single sampling ~ rectangular sample.ran.rect=function(data, side.x, side.y, plotdim=c(1000,500)) { xlo=runif(1,min=0,max=plotdim[1]-side.x) ylo=runif(1,min=0,max=plotdim[2]-side.y) xhi=xlo+side.x yhi=ylo+side.y randsample=subset(data, gx>=xlo&gx<xhi&gy>=ylo&gy<yhi) no.ind=length(randsample$sp) no.spp=length(unique(randsample$sp)) return(c(side.x*side.y/1e4,no.ind,no.spp)) }

Sample Single sampling ~ circle sample.ran.circle=function(data, radius, plotdim=c(1000,500),graph=T) { xlo=runif(1,min=radius,max=plotdim[1]-radius) ylo=runif(1,min=radius,max=plotdim[2]-radius) num.ind=length(data$gx) place=numeric() dist=((data$gx-xlo)^2+(data$gy-ylo)^2)^0.5 dist.n=dist<=radius place=(1:num.ind)[dist.n] sample.new=data[place,] sample.new=sample.new[is.na(sample.new$tag)==F,] if(graph){ plot(data$gx,data$gy,xlab="x",ylab="y") points(sample.new$gx,sample.new$gy,type="p", col="blue") } cat("Random sample point is",c(xlo,ylo),sep=" ","\n") return(sample.new) }

Sample ################# spp.area=function(data,sides) { result=list() nrow=length(sides) for (i in 1:nrow) result[[i]]=spp.area.onetime(data,sides[i]) fullresult=matrix(nrow=0,ncol=5) for(i in 1:length(result)) { fullresult=rbind(fullresult,result[[i]]) } colnames(fullresult)=c("area","ind","spp","shannon","simpson") fullresult=data.frame(fullresult) par(mfrow=c(2,2)) plot(fullresult$area,fullresult$spp,col="red") lines(fullresult$area,fullresult$spp,col="green") plot(fullresult$area,fullresult$ind,col="blue") lines(fullresult$area,fullresult$ind,col="green") plot(fullresult$area,fullresult$shannon,col="black") lines(fullresult$area,fullresult$shannon,col="black") plot(fullresult$area,fullresult$simpson,col="black") lines(fullresult$area,fullresult$simpson,col="black") return(fullresult) } • Nest sampling spp.area.onetime=function(data,side) { randsample=subset(data,data$gx>=0&data$gx<side&data$gy>=0&data$gy<side) no.ind=length(randsample$sp) no.spp=length(unique(randsample$sp)) abund=numeric() splist=unique(randsample$sp) for (i in 1:no.spp) {abund[i]=length(randsample$sp[randsample$sp==splist[i]]) } p=abund/sum(abund) shannon=-sum(p*log(p)) simpson=1-sum(p^2) return(c(side^2/1e4,no.ind,no.spp,shannon,simpson)) } ############

5×5 10×10 20×20 50×50 100×100 250×250 Sample • A grid system • sample.side=function(data,side,plotdim) • { • x=y=seq(0,plotdim[1]-side,by=side) • n=length(seq(0,plotdim[1]-side,by=side)) • x1=rep(x,each=n) • y1=rep(y,n) • loc=data.frame(x1,y1) • no.ind=numeric() • no.spp=numeric() • for (i in 1:n^2) • {randsample=subset(data,data$gx>=loc[i,]$x1&data$gx<(loc[i,]$x1+side)&data$gy>=loc[i,]$y1&data$gy<(loc[i,]$y1+side)) • no.ind[i]=length(randsample$sp) • no.spp[i]=length(unique(randsample$sp))} • return(data.frame(x=x1,y=y1,ind=no.ind,sp=no.spp)) • }

Variogram Given a geostatistical model, Z(s), its variogram g(h) is formally defined as where f(s, u) is the joint probability density function of Z(s) and Z(u). For an intrinsic random field, the variogram can be estimated using the method of moments estimator, as follows: where h is the distance separating sample locations si and si+h, N(h) is the number of distinct data pairs. In some circumstances, it may be desirable to consider direction in addition to distance. In isotropic case, h should be written as a scalar h, representing magnitude.

1.2 sill = 1.0 0.8 g(h) 0.4 nugget = 0.2 range h = 5 0.0 0 2 4 6 8 10 h Variogram The main goal of a variogram analysis is to construct a variogram that best estimates the autocorrelation structure of the underlying stochastic process. A typical variogram can be described using three parameters: Nugget effect – represents micro-scale variation or measurement error. It is estimated from the empirical variogram at h= 0. Range – is the distance at which the variogram reaches the plateau, i.e., the distance (if any) at which data are no longer correlated. Sill – is the variance of the random field V(Z), disregarding the spatial structure. It is the plateau where the variogram reaches at the range, g(range).

Variogram result=sample.side(bci,side=20,plotdim=c(500,500)) result[1:4,] library(geoR) abund=result[,1:3] abund.geo=as.geodata(abund) dist=seq(0,500,20) abund.var=variog(abund.geo,dir=0,uvec=dist) ##compute semi-variance in other directions plot(abund.var) spp=result[,c(1,2,4)] spp.geo=as.geodata(spp) spp.var=variog(spp.geo,dir=0,uvec=dist)

Variogram library(geoR) Tri2pa=subset(bci,sp==“TRI2PA”&dbh>0) Tri2=Tri2pa[,3:5] Tri2.geo=as.geodata(Tri2) dist=seq(0,200,10) var=variog(Tri2.geo,dir=0,uvec=dist) #variog(coords=Tri2pa[,3:4],data=Tri2pa[,5],dir=0,uvec=dist) plot(var) variog4(Tri2.geo,uvec=dist)

Varpart

Varpart library(vegan) ; varpart

Sampling species • • • Species-area curve • • • • • • • Number of species • • • • • • • • • area Species-area relationship

Logarithmic model Michaelis-Menten Power model Logistic model Species-area relationship The generalized species-area model A special case of the logistic model (z = 1): The derivative describes the rate of change in the number of species with one unit of change in area. An important point to make: models change with the change of scale. He, F. and Legendre, P. 1996. On species-area relationships. American Naturalist 148(4): 719-737

Species-area relationship # fit the logarithmic model # logarithm.div.fn=function(data) {##nsp=c+z*ln(area)## area=data$area nsp=data$spp lm.out=lm(nsp~log(area)) lm.res=residuals(lm.out) res.square=sum(lm.res^2) prd.s=predict(lm.out) cat("\n Logarithm model:residuals^2 = ", res.square, "\n") return(list(prd.s=prd.s,sum=summary(lm.out))) } #######data here is the result of nest sampling####

Species-area relationship # fit the power model # power.div.fn=function(data) {##nsp=b*area^(z) area=data$area nsp=data$spp nls.out=nls(nsp~b*area^z,start=c(b=10,z=0.5)) nls.res=residuals(nls.out) res.square=sum(nls.res^2) prd.s=predict(nls.out) cat("\n Power model: residuals^2 = ", res.square, "\n") return(list(prd.s=prd.s,sum=summary(nls.out))) } #######data here is the result of nest sampling####

Species-area relationship # fit the logistic model # logist.div.fn=function(data) {##nsp=b/(a+area(-z))## area=data$area nsp=data$spp nls.out=nls(nsp~b/(a+area^(-z)),start=c(b=10,a=0.1,z=0.5)) nls.res=residuals(nls.out) res.square=sum(nls.res^2) prd.s=predict(nls.out) cat("\n Logistic model: residuals^2 = ", res.square, "\n") return(list(prd.s=prd.s,sum=summary(nls.out))) } #######data here is the result of nest sampling####

Species-area relationship div.area.r=function(data) { area=data$area nsp=data$spp prd.log.s=logarithm.div.fn(data) prd.power.s=power.div.fn(data) prd.logist.s=logist.div.fn(data) par(mfrow=c(1,2)) plot(area,nsp, xlab="area of sample", ylab="number of species",type="b") lines(area,prd.log.s$prd.s,col="blue") lines(area,prd.power.s$prd.s,col="green") lines(area,prd.logist.s$prd.s,col="red") plot(log(area),nsp, xlab="log(area of sample)", ylab="number of species",type="b") lines(log(area),prd.log.s$prd.s,col="blue") lines(log(area),prd.power.s$prd.s,col="green") lines(log(area),prd.logist.s$prd.s,col="red") } #######data here is the result of nest sampling####

Species-abundance relationship/distribution (SAD) • SAD is simply the abundance of each species in a community. In no community do species have equal abundance. Instead, communities are almost always found to have many rare species and a few common ones. Ecologists believe that SAD is an ecological and evolutionary product which reflects the interactions of species, the effect of habitat adaptation and population fitness. Therefore, each community should have its own characteristic SAD. • SAD is important because • It completely describes the structure (species composition) of a community. • It provides evidence for understanding community assembly rules: how species come into coexistence? How species share resources? Why there are always many more rare species than common ones? Why some species have to be rare while others to be so abundant? • Species-abundance data provide an only solid basis to examine biodiversity (because SAD contains all the information about abundance of each species).

[1], [2-3], [4-7], [8-16], … [1]/2,[1]/2+ [2]/, [2]/+[3]+[4]/2, [4]/2+[5-7]+[8]/2, … cdf.obs[i]=length(abund[abund<=x[i]]) abund1=sort(abund.data,decreasing=T)

SAD ############################# count1=function(abund.dat) { ##data is bci$sp splist=unique(abund.dat);abund=numeric() nsp=length(splist) for (i in 1:nsp) { abund[i]=length(abund.dat[abund.dat==splist[i]]) } return(data.frame(splist=splist,abund=abund)) } ######################################## • SAD. Preston1.bin • # plot RSA using Preston's 2^2 scale: (0, 1], (1, 3], (3, 7], (7,15],... • #Volkov's method for splitting boundary binnings to right and left bins • ##[1]/2, [1]/2+[2]/2,[2]/2+[3]+[4]/4,[4]/2+[5-7]+[8]/2,….. • ############# • sp.abund=function(data) • { • ##data is bci$sp • sp=count1(data)$splist • abund=count1(data)$abund • spnum=length(sp) • Preston1.nsp=numeric() • Preston2.nsp=numeric() • number=ceiling(log2(max(abund))) • for (i in 1:number){ • for (j in 2:i+1) • { • Preston1.nsp[i]=sum(length(sp[abund>=2^(i-1)&abund<=(2^i-1)])) • Preston2.nsp[1]=0.5*length(sp[abund==1]) • Preston2.nsp[j]=sum(length(sp[abund>=2^(j-2)&abund<=(2^(j-1))]))-0.5*length(sp[abund==2^(j-2)])-0.5*length(sp[abund==2^(j-1)]) • } • } • par(mfrow=c(2,2)) • hist(abund,xlab="Abundance",ylab="Species frequency",main="RSA") • plot(c(1:number), Preston1.nsp, xlab="Abundance class", ylab="Species frequency", type="h", main="RSA: Preston Binning 1", lwd=4) • lines(c(1:number),Preston1.nsp,col="red") • plot(c(1:(number+1)), Preston2.nsp, xlab="Abundance class", ylab="Species frequency", type="h", main="RSA: Preston Binning 2", lwd=4) • lines(c(1:(number+1)),Preston2.nsp,col="green") • plot(1:length(sp), log(sort(abund, decreasing=T)), type="o", xlab="Species sequence", ylab="log(abundance)", main="Dominance Curve") • cat("\n observed number of species = ", spnum, "\n") • return(list(preston1.bin=Preston1.nsp,preston2.bin=Preston2.nsp)) • }

n = 1, 2, 3, … …. SAD~logseries distribution • This model is first used by Fisher, Corbet & Williams (1943) to model species-abundance pattern of Malayan butterflies and Lepidoptera (butterflies, moths) caught by a light-trap at Rothamsted Experimental Station. It represents the first attempt to describe the mathematical relationship between the number of species and the number of individuals in those species. It is one of the two best known SAD models. • The frequency (the number of species) with n individuals in a community is • where a and x are two parameters, a > 0 and 0 < x < 1 (in most cases x > 0.9), but they are not independent (we will show that in a moment). • The terms of the logseries distribution are the number of species with one individual, two individuals, three individuals…:

n = 1, 2, 3, … • Estimating methods • MLE • Here pn is the probability, no longer frequency.

That is why it is called logarithmic series, while in fact is a power series dist. S = 1 if the logseries model is described as probability rather than frequency. Based on these terms, we can calculate the total number of species and total number of individuals. 1. Total number of species, S, is obtained by adding all the terms in the series: This summation reduces to the following equation (how?): 2. The total number of individuals:

n = 1, 2, 3, … • Estimating methods • Moment method (what that means?) • For bci abundance data : • S = 321, N = 368122. • Results: a = 34.62111, x = 0.99991

Parameter estimation SAD~logseries logseries.moment=function(alpha, abund){ ###sp.abund=count1(bci$sp); abund=sp.abund$abund # alpha is initial value, which is needed for the Newton method # estimate the parameters for the logseries distribution S=length(abund) # total number of species N=sum(abund) # total number of individuals optim.out=optim(alpha,optim.moment.fn, S=S,N=N) print(optim.out) alpha=optim.out$par x=N/(N+alpha) cat("\n alpha = ", round(alpha,5), "x = ", round(x,5), "\n") } ######################### # optimization function # ######################### optim.moment.fn=function(alpha,S,N) { # The function is: S=alpha*log(1+N/alpha) abs(S-alpha*log(1+N/alpha)) }

Parameter estimation SAD~logseries Maximum likelihood estimates logseries.mle.r=function(alpha, abund) { S=length(abund) # total number of species N=sum(abund) # total number of individuals optim.mle.out=optim(alpha,optim.mle.fn,,abund=abund) alpha=optim.mle.out$par x=1-exp(-1/alpha) cat("\n alpha = ", round(alpha,5), "x = ", round(x,5), "\n") } ######################### # optimization function # ######################### optim.mle.fn=function(alpha,abund) { S=length(abund) N=sum(abund) obj.fn=S*log(alpha)+N*(log(1-exp(-1/alpha)))-sum(log(abund)) -obj.fn }

SAD~ Lognormal distribution Lognormal distribution It is one of the two most widely used species-abundance model, first used by Preston in 1948. Different from the logseries model, it is a model for continuous random variable, i.e., abundance is considered as a continuous random variable (the model is therefore applicable to biomass and cover data). This assumption is approximately valid for species rich communities. It has been found that many real communities follow the lognormal distribution. This is particularly true for species rich communities such as tropical rainforests. If y = ln(x) is normally distributed y ~ N(s2, m), then x = exp(y) is lognormal. It has the form:

Estimation methods (MLE and moment estimate): where m and s2 are the mean and variance of the normally distributed y which are the mean and variance of the log-transformed abundance data. The mean and variance of the lognormal distribution are:

BCI data: • Steps to estimate m and s2: • Log-transform the abundance data, • Calculate the mean and variance of the log-transformed data: m = 4.8194, s2 = 5.7998, • Substitute the mean and variance into the lognormal model:

Parameter estimation SAD~Lognormal distribution mle.lognormal.r=function(data, mu, sigma) {###sp.abund=count1(bci$sp); data=sp.abund$abund # MLE estimation of lognormal distribution # There are 2 parameters: mean (mu) and std (sigma) startparam <- c(mu, sigma) out <- optim(c(startparam[1], startparam[2]), fn = lognormal.llik.obj, x=data) mu <- out$par[1] sigma <- out$par[2] AIC=out$value*2+2*length(startparam) cat(" AIC= ",round(AIC,5), "\n") cat(" mu = ", round(mu, 5), "\n") cat(" sigma = ", round(sigma, 5), "\n") } #################################################### # The objective of the normal likelihood function # #################################################### lognormal.llik.obj=function(param, x) { mu = param[1] sigma = param[2] s=length(x) obj.fn=-0.5*s*log(2*pi)-s*log(sigma)-sum(log(x))-0.5*(sigma^(-2))*sum((log(x)-mu)^2) -obj.fn }

Model validation AIC (Akaike Information Criterion) • AIC=-2L + 2k • L is the log-likelihood and k is the number of parameters in the models. • AICc=AIC+2k(k+1)/(n-k-1) • For small sample sizes(n): n/k<40 • ∆AIC<2: equivalent; • ∆AIC 4-7: distinguishable; • ∆AIC>10: definitely different

Thank you!

用 R 语言进行生物多样性分析

用 R 语言进行生物多样性分析

Presentation Transcript

DIVERSITY

Diversity

Diversity

Diversity

DIVERSITY :

Diversity

DIVERSITY

Diversity

“Diversity”

Diversity

DIVERSITY

Diversity

Diversity

DIVERSITY

Diversity

Diversity

DIVERSITY

Diversity

Diversity

Diversity

Diversity!

DIVERSITY