1 / 59

An brief introduction to R , Package and Graphics used in biodiversity research

第九届生物多样性会议 ( 2010-11-4, 福建 , 厦门). An brief introduction to R , Package and Graphics used in biodiversity research. Kechang NIU ( 牛克昌 ) Department of Ecology , Peking University. Outline. What this is.

colum
Download Presentation

An brief introduction to R , Package and Graphics used in biodiversity research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 第九届生物多样性会议(2010-11-4, 福建, 厦门) An brief introduction to R , Package and Graphics used in biodiversity research Kechang NIU (牛克昌) Department of Ecology , Peking University

  2. Outline What this is Section 1>-A short, highly incomplete tour around some of the basic concepts of R as a programming language Section 2 >-Some hints on how to obtain documentation on the many library functions (packages) Section 3>-A briefintroduction of R graphics

  3. R, S and S-plus Section 1 S: an interactive environment for data analysis developed at Bell Laboratories since 1976 1988 - S2: RA Becker, JM Chambers, A Wilks 1992 - S3: JM Chambers, TJ Hastie 1998 - S4: JM Chambers Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: “S-plus”. Implementation languages C, Fortran. See: http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html

  4. Section 1 Section 1 Section 1 Section 1 R, S and S-plus • R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. • Since 1997: international “R-core” team of ca. 15 people with access to common CVS archive. • GNU General Public License (GPL) • can be used by anyone for any purpose • contagious • Open Source • quality control! • efficient bug tracking and fixing system supported by the user community

  5. Section 1 What R is What R is not • data handling and storage: numeric, textual • matrix algebra • hash tables and regular expressions • high-level data analytic and statistical functions • classes (“OO”) • graphics • programming language: loops, branching, subroutines • is not a database, but connects to DBMSs • has no graphical user interfaces, but connects to Java, TclTk • language interpreter can be very slow, but allows to call own C/C++ code • no spreadsheet view of data, but connects to Excel/MsOffice • no professional / commercial support

  6. Section 1 R Packagingand statistics • Packaging: a crucial infrastructure to efficiently produce, load and keep consistent software libraries from (many) different sources / authors • Statistics: most packages deal with statistics and data analysis • State of the art: many statistical researchers provide their methods as R packages

  7. Section 1 R as a calculator > log2(32) [1] 5 > sqrt(2) [1] 1.414214 > seq(0, 5, length=6) [1] 0 1 2 3 4 5 > plot(sin(seq(0, 2*pi, length=100)))

  8. Section 1 Variables > a = 49 > sqrt(a) [1] 7 > a = "The dog ate my homework" > sub("dog","cat",a) [1] "The cat ate my homework“ > a = (1+1==3) > a [1] FALSE Numeric Character string Logical

  9. Section 1 Missing values Variables of each data type (numeric, character, logical) can also take the value NA: not available. >-NA is not the same as 0 >- NA is not the same as “” >- NA is not the same as FALSE Any operations (calculations, comparisons) that involve NA may or may not produce NA: > NA==1 [1] NA > 1+NA [1] NA > max(c(NA, 4, 7)) [1] NA > NA | TRUE [1] TRUE > NA & TRUE [1] NA

  10. Section 1 Functions and Operators Functions do things with data “Input”: function arguments (0,1,2,…) “Output”: function result (exactly one) Example: add = function(a,b) { result = a+b return(result) } Operators: Short-cut writing for frequently used functions of one or two arguments. Examples: + - * / ! & | %%

  11. Section 1 Vectors and Matrices Vector:an ordered collection of data of the same type > a = c(1,2,3) > a*2 [1] 2 4 6 Matrix:a rectangular table of data of the same type Example:the expression values for 10000 genes for 30 tissue biopsies: a matrix with 10000 rows and 30 columns.

  12. Section 1 Data frames Data frame:is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical),but different columns may have differenttypes.And each row on one case Example: > a localisation tumorsize progress XX348 proximal 6.3 FALSE XX234 distal 8.0 TRUE XX987 proximal 10.0 FALSE

  13. Section 1 Loops When the same or similar tasks need to be performed multiple times; for all elements of a list; for all columns of an array; etc. for(i in 1:10) { print(i*i) } i=1 while(i<=10) { print(i*i) i=i+sqrt(i) }

  14. Section 1 lapply, sapply, apply When the same or similar tasks need to be performed multiple times for all elements of a list or for all columns of an array. May be easier and faster than “for” loops lapply( li, fct ) To each element of the list li, the functionfct is applied. The result is a list whose elements are the individual fctresults. > li = list("klaus","martin","georg") > lapply(li, toupper) > [[1]] > [1] "KLAUS" > [[2]] > [1] "MARTIN" > [[3]] > [1] "GEORG"

  15. Section 1 Regular expressions A tool for text matching and replacement which is available in similar forms in many programming languages (Perl, Unix shells, Java) > a = c("CENP-F","Ly-9", "MLN50", "ZNF191", "CLH-17") > grep("L", a) [1] 2 3 5 > grep("L", a, value=T) [1] "Ly-9" "MLN50" "CLH-17" > grep("^L", a, value=T) [1] "Ly-9" > grep("[0-9]", a, value=T) [1] "Ly-9" "MLN50" "ZNF191" "CLH-17" > gsub("[0-9]", "X", a) [1] "CENP-F" "Ly-X" "MLNXX" "ZNFXXX" "CLH-XX"

  16. Section 1 Storing data Every R object can be stored into and restored from a file with the commands “save” and “load”. This uses the XDR (external data representation) standard of Sun Microsystems and others, and is portable between MS-Windows, Unix, Mac. > save(x, file=“x.Rdata”) > load(“x.Rdata”)

  17. Section 1 Section 1 Section 1 Importing and exporting data • There are many ways to get data into R and out of R. • Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tab-delimited text files. > x = read.delim(“filename.txt”) also: read.table, read.csv > write.table(x, file=“x.txt”, sep=“\t”)

  18. Section 1 Section 1 Section 1 Statistical functions • t.test() • Student (test t), determines if the averages of two populations are statistically different. • prop.test() • hypothesis tests with proportions • Non-parametrical tests • kruskal.test()Kruskal-Wallis test (variance analysis) • chisq.test()2 test for convergence • ks.test()Kolmogorov-Smirnov test

  19. Section 1 lm() > pisa.lm <- lm(inclin ~ year) > pisa.lm > summary(pisa.lm) Call: lm(formula = inclin ~ year) Residuals: Min 1Q Median 3Q Max -5.9670 -3.0989 0.6703 2.3077 7.3956 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -61.1209 25.1298 -2.432 0.0333 * year 9.3187 0.3099 30.069 6.5e-12 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 4.181 on 11 degrees of freedom Multiple R-Squared: 0.988, Adjusted R-squared: 0.9869 F-statistic: 904.1 on 1 and 11 DF, p-value: 6.503e-012

  20. Section 1 Getting help Details about a specific command whose name you know (input arguments, options, algorithm, results): >? t.test or >help(t.test)

  21. Section 1 o HTML search engineo search for topics with regular expressions:“help.search” Getting help

  22. Section 1 Web sites www.r-project.org cran.r-project.org www.bioconductor.org Full text search: www.r-project.org or www.google.com with ‘… site:.r-project.org’ or other R-specific keywords

  23. Section 2 R Packages • There are many contributed packages that can be used to extend R. • These libraries are created and maintained by the authors.

  24. Section 2 Installing R packages from CRAN • install.packages(‘packageName’) • OR in Rgui: • select a local repository (if needed) • select package(s) from list

  25. Section 2 Ade4>-Analysis of Ecological Data : Exploratory and Euclidean methods in Environmental sciences http://pbil.univ-lyon1.fr/ADE4/home.php?lang=eng • ade4 is characterized by : • the implementation of graphical and statistical functions • - the availability of numerical data • - the redaction of technical and thematic documentation • - the inclusion of bibliographic references

  26. Section 2 Vegan: R functions for vegetation ecologists

  27. Introduction • R for Ecologists, a primer on the S language and available software • Familiarization with Data • Lab 1 Loading Vegetation Data and Simple Graphical Data Summaries • Lab 2 Loading Site/Environment Data and Simple Graphical Summaries • Lab 3 Vegetation Tables and Summaries • Modeling Species Distributions • Lab 4 Modeling Species Distributions with Generalized Linear Models • Lab 5 Modeling Species Distributions with Generalized Additive Models • Lab 6 Modeling Species Distributions with Classification Trees • Ordination • Lab 7 Principal Components Analysis • Lab 8 Principal Coordinates Analysis • Lab 9 Nonmetric Multi-Dimensional Scaling • Lab 10 Correspondence Analysis and Detrended Corresponence Analysis • Lab 11 Fuzzy Set Ordination • Lab 12 Canonical Correspondence Analysis • Cluster Analysis • Lab 13 Cluster Analysis • Lab 14 Discriminant Analysis with Tree Classifiers Vegan: R Labs for Vegetation Ecologists http://ecology.msu.montana.edu/labdsv/R/labs/ • Familiarization with Data • Lab 1 Loading Vegetation Data and Simple Graphical Data Summaries • Lab 2 Loading Site/Environment Data and Simple Graphical Summaries • Lab 3 Vegetation Tables and Summaries • Modeling Species Distributions • Lab 4 Modeling Species Distributions with Generalized Linear Models • Lab 5 Modeling Species Distributions with Generalized Additive Models • Lab 6 Modeling Species Distributions with Classification Trees • Ordination • Lab 7 Principal Components Analysis • Lab 8 Principal Coordinates Analysis • Lab 9 Nonmetric Multi-Dimensional Scaling • Lab 10 Correspondence Analysis and Detrended Corresponence Analysis • Lab 11 Fuzzy Set Ordination • Lab 12 Canonical Correspondence Analysis • Cluster Analysis • Lab 13 Cluster Analysis • Lab 14 Discriminant Analysis with Tree Classifiers • Introduction • R for Ecologists, a primer on the S language and available software

  28. Analysis of Phylogenetics and Evolution http://ape.mpl.ird.fr/ • ape is not a classical software (i.e., a "black box" though it can be used as one) but an environment for data analyses and development of new methods where interactivity and user's decisions are left an important place. • ape is written in R, but some functions for computer-intensive tasks are written in C. • ape is available for all main computer operating systems (Linux, Unix, Windows, MacOS X 10.4 and later). • ape is distributed under the terms of the GNU General Public Licence, meaning that it is free and can be freely modified and redistribued (under some conditions).

  29. smatr:(Standardised) Major Axis Estimation and Testing Routines This package provides methods of fitting bivariate lines in allometry using the major axis (MA) or standardised major axis (SMA), and for making inferences about such lines.

  30. BiodiversityR: GUI for biodiversity and community ecology analysis • This package provides a GUI and some utility functions (often based on the vegan package) for statistical analysis of biodiversity and ecological communities, • species accumulation curves, • diversity indices, Renyi profiles, • GLMs for analysis of species abundance and presence-absence • distance matrices, • Mantel tests, and cluster, constrained and unconstrained ordination analysis. A book on biodiversity and community ecology analysis is available for free download from the website.

  31. untb: ecological drift under the UNTB A collection of utilities for biodiversity data. Includes the simulation of ecological drift under Hubbell's Unified Neutral Theory of Biodiversity, and the calculation of various diagnostics such as Preston curves. Functions (34) alonsoVarious functions from Alonso and McKane bciBarro Colorado Island (BCI) dataset butterfliesabundance data for butterflies etienneEtienne's sampling formula expected.abundanceExpected abundances under the neutral model plot.countAbundance curves rand.neutral Random neutral ecosystem volkovExpected frequency of species zsmZero sum multinomial distribution as derived by McKane 2004

  32. sem: Structural Equation Models semis an R package for fitting structural-equation models. The package supports general structural equation models with latent varibles, fit by maximum likelihood assuming multinormality, and single-equation estimation for observed-variable models by two-stage least.squares

  33. R2WinBUGS: Running WinBUGS and OpenBUGS from R / S-PLUS The R2WinBUGS package provides convenient functions to call WinBUGS from R. It automatically writes the data and scripts in a format readable by WinBUGS for processing in batch mode, which is possible since version 1.4. After the WinBUGS process has nished, it is possible either to read the resulting data into R by the package itself which gives a compact graphical summary of inference and convergence diagnostics or to use the facilities of the coda package for further analyses of the output.

  34. Others bio.infer:Maximum Likelihood Method for Predicting Environmental Conditions from Assemblage Composition popbio: Construction and analysis of matrix population models Construct and analyze projection matrix models from a demography study of marked individuals classified by age or stage. The package covers methods described in Matrix Population Models by Caswell (2001) and Quantitative Conservation Biology by Morris and Doak (2002). simecolallows to implement ecological models (ODEs, IBMs, ...) using a template-like object-oriented stucture. It helps to organize scenarios and may also be useful for other areas. demogR: Analysis of age-structured demographic models

  35. Making Elegant Graphics in R Section 3 Section 3 Section 3 Section 3 Graphical capabilities Simple, exploratory graphics are easy to produce. Publication quality graphics can be created. Several device drivers are available including: On-screen graphics Postscript,

  36. Section 3 Section 3 Section 3 Types of graphics functions • High level functions • >-Plot graphs from data • Low level functions • >-Add to/Modify existing graphs • Graphics parameters • >-Control how plots are displayed

  37. Section 3 Section 3 Section 3 Some and high level functions Low High • points() • lines() • abline() • text() • mtext() • title() • legend() • etc • plot() • hist() • barplot() • boxplot()

  38. Section 3 Section 3 Section 3 Section 3 Section 3 Graphical parameters • list of parameters held in par() • controls how plots are displayed • values in par() can be over-ridden by calls to high-level functions • values are changed temporarily for plot • no permanent change to par() values • can be modified directly

  39. Section 3 Plotting single variables: histogram • First generate data x<-rnorm(100) • Plot histogram: hist(x) rrnorm takes n random samples from a normal distribution Can specify mean and sd, defaults are (0,1) hist is a generic function to plot histograms of a dataset

  40. boxplot function boxplot(x1) Add title title(main="Boxplot of x1") Section 3 Plotting single variables: boxplot

  41. Section 3 Plotting single variables: boxplot • Generate more data • x2<-rnorm(100) • x3<-rnorm(100,2) boxplot(x1,x2,x3) title(main=“three boxplots”) OR bp3<-data.frame(x1,x2,x3) boxplot(bp3)

  42. tN <- table(Ni <- rpois(100, lambda=5)) barplot(tN) barplot(tN,col=rainbow(20) Section 3 Plotting single variables: barplot

  43. Section 3 Barplots for grouped data • Using the VADeaths table from the datasets package: • >-library(datasets); data(VADeaths); VADeaths • Rural Male Rural Female Urban Male Urban Female 50-54 11.7 8.7 15.4 8.4 55-59 18.1 11.7 24.3 13.6 60-64 26.9 20.3 37.0 19.3 65-69 41.0 30.9 54.6 35.1 70-74 66.0 54.3 71.1 50.0

  44. barplot(VADeaths, col=rainbow(5),legend=T) Section 3 Barplots for grouped data barplot(VADeaths, col=rainbow(5), legend=T,beside=T)

  45. Section 3 Plotting single variables: scatterplot function (n=20) {x1<-rnorm(n) par(mfrow=c(3,2)) plot(x1,type="p",main="type p: points") plot(x1,type="l",main="type l: lines") plot(x1,type="b",main="type b: both") plot(x1,type="o",main="type o: overplot") plot(x1,type="h",main="type h: histogram") plot(x1,type="s",main="type s: steps") par(mfrow=c(1,1))}

  46. Section 3 Plotting two variables: scatterplot • Example data: “women” from the datasets package . Heights vs weights for America women To get data into workspace: >- library(datasets);data(women); attach(women) plot(x=height,y=weight, data=women) plot(weight~height) plot(women$height,women$weight)

  47. Adding to plots: lines Section 3 • Straight lines added using “abline” function abline(v=x) >-draws a vertical line at value x abline(v=y) >-draws a horizontal line at value y abline(a=x,b=m)>- draws a line with intercept x and gradient m • lines() adds a series of connected lines • takes vectors of x and y co-ordinates and draws lines between them • segments() draws (disjoint) lines segments between pairs of points

  48. Adding to plots: lines Section 3 • “cars” from the datasets package has speed and stopping distances • draw scatter plot and add regression line library(datasets) data(cars) attach(cars) plot(dist~speed) cars.lm<-lm(dist~speed) abline(cars.lm, col="red“,lty=2)

  49. function (df=1,p=0.05) {x<-seq(1,50,0.01) y<-dchisq(x,df) plot(x,y,type="l",xlim=c(0,10),ann=F) title(xlab=substitute(paste(chi^2,"statistic")),ylab="Frequency",main=(substitute(paste(chi^2,"distribution with",df,"df, p=",p)))) critx<-qchisq(p,df,lower.tail=F) abline(v=critx,col="red") xr<-c(critx,x[x>critx]) yr<-c(dchisq(critx,df),y[x>critx]) polygon(x=c(critx,critx,xr,10), y=c(0,dchisq(critx,df),yr,0),col="red”) xr<-round(critx,0)+2 yr<-max(y)/4 rx<-round(critx,3) text(xr,yr,substitute(paste("Critical ",chi^2,"=",rx)),col="red")} Section 3 Adding to plots: lines and text

  50. Section 3 Adding to plots: margin text • mtext() adds text in the margins of figures • Margins are numbered 1:4, clockwise from the bottom • Text is printed a specified number of lines away from the edge of the plot area • Can print in outer margins if available

More Related