r and modern statistical computing n.
Skip this Video
Loading SlideShow in 5 Seconds..
R and Modern Statistical Computing PowerPoint Presentation
Download Presentation
R and Modern Statistical Computing

Loading in 2 Seconds...

play fullscreen
1 / 43

R and Modern Statistical Computing - PowerPoint PPT Presentation

  • Uploaded on

R and Modern Statistical Computing. Robert Gentleman. Outline. Introduction R past R present R future Bioconductor. What is R?. R is an environment for data analysis and visualization R is an open source implementation of the S language

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'R and Modern Statistical Computing' - quennell-poirier

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Introduction
  • R past
  • R present
  • R future
  • Bioconductor
what is r
What is R?
  • R is an environment for data analysis and visualization
  • R is an open source implementation of the S language
  • S-Plus is a commercial implementation of the S language
  • The current version of R is 1.4.1
  • www.r-project.org
r core
R Core
  • Doug Bates, John Chambers, Peter Dalgaard, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto, Paul Murrell, Brian Ripley, Duncan Temple Lang, and Luke Tierney
  • Duncan Murdoch, Martyn Plummer, Vincent Carey
funding for r
Funding for R
  • to date R has had little funding (no formal funding)
  • our universities, particularly the University of Auckland have provided support
  • Dept of Biostatistics, Harvard has donated 5,000.00
r history
R History
  • 1991: Ross Ihaka and Robert Gentleman begin work on a project that will ultimately become R
  • 1992: Design and implementation of pre-R.
  • 1993: The first announcement of R
  • 1995: R available by ftp under the GPL
r history1
R History
  • 1996: A mailing list is started and maintained by Martin Maechler at ETH
  • 1997: The R core group is formed
  • 1999: DSC meeting in Vienna, the first time many of R core meet
  • 2000: R 1.0.0 is released
  • 2002: R 1.4.1 is the current release
open source
Open Source
  • R is both open source and open development
    • you can look at the source code and you can propose changes that we will generally adopt
  • R is not in the public domain
  • You are given a license to run our software
    • GPL (current)
    • LGPL (under consideration)
r and omegahat
R and Omegahat
  • Omegahat: www.omegahat.org
  • Omegahat is another initiative that will allow us to explore alternative implementations and languages without disturbing the R user base too much
  • Current contents are largely the work of Duncan Temple Lang and John Chambers
r design
R Design
  • Many of the features of S but with slightly different semantics and memory management.
  • We chose Scheme for our semantic model.
  • Much of the original code has since been replaced but the basic model remains intact.
r internals
R Internals
  • R is written mainly in C
  • Our original intention was for R to be as platform independent as practical.
  • We began with Macintosh as a primary delivery platform and Unix as our primary development platform.
what platforms
What Platforms?
  • Unix of many flavours including Linux, Solaris, FreeBSD, AIX (compiles on 64 bit machines)
  • Windows - 95/98/NT and 2000
  • both binaries and source available
  • R can be obtained from
    • www.r-project.org
r internals1
R Internals
  • One difference with S is scope
    • R uses a different set of rules to bind variables to values
  • In S it is hard to treat programs as data
  • R should be source code compatible with S-Plus for most code that you will write
  • an environment is a mechanism for binding symbols to values (hence similar to a hash table)
  • each environment has a parent environment
  • a big difference between R and S is that R has lexical scope
  • a function has an environment associated with it and that environment provides bindings for any free variables function
  • another way that this can be thought of is that in R functions have mutable state
  • Ihaka and Gentleman (JCGS, 2000)
  • environments are also associated with formulas in R
how did we do it
How did we do it?
  • we took advantage of certain technologies
  • CVS – for version control
  • a reasonably sophisticated checking system
  • every example in R is runnable and is run many times by all users
  • any changes made must pass the checking routine before they are commited
  • this very simple idea makes distributed development possible
  • I am responsible for writing examples for my code (and I should be because I know it)
  • others are responsible for making sure that they do not break my code (by running my examples)
r package system
R Package System
  • packages are self-contained units of code with documentation
  • there are automatic testing features built in
  • all functions must have examples and the examples must run
  • interesting commands:
    • example, update.packages
  • R will talk to most databases
  • the ability to access large tables, execute SQL queries etc
  • RPgSQL has the notion of proxy objects
    • R symbols refer to tables in the database
    • these can look like data.frames in R
object oriented programming
Object Oriented Programming
  • S3 class system is a good start but it has some major deficiencies
  • in Programming with Data, John Chambers introduced a new and potentially much better system
  • object oriented programming helps us build better programs and deal more naturally with complex data structures
object oriented programming1
Object Oriented Programming
  • a formal mechanism for defining classes of objects
  • these provide us with an abstraction that lets us deal with complex data
  • generic functions and methods also reduce complexity (for the user)
    • plot is a generic, methods are defined to implement plot for different types of data
object oriented programming2
Object Oriented Programming
  • more important for developers than for users
  • it may not be worth defining classes and methods interactively
  • Vincent Carey has been working on better mechanisms for documenting the classes and methods
r as a broker
R as a broker
  • R can execute code in virtually any other language
  • R has connections, these can be used to access data via different protocols
  • R is embeddable in other languages
    • systems like Perl, Python, Postgres, Apache
    • allow the user to define and use procedural languages
r as a broker1
R as a broker
  • this means that we can push the calculations to more natural places
  • computation can be done where the data are rather than by transporting data
  • this will greatly increase our ability to process large data sets
r future
R: Future
  • where to next?
  • XML and markup languages
  • compilation
  • object oriented programming
  • eXtensible Markup Language
  • has many friends, XSLT, XLINK, …
  • similar to HTML, but more flexible
  • <foo> hi there </foo>
  • I define my own tags, and provide information about their meaning
  • it allows us to provide semantics/meaning to data
  • it separates content from presentation
  • content can be presented in many different ways (SAS – output)
  • we can use a single parser written by an expert
  • data can be read and understood directly from the source
  • eg: we want to search PubMed abstracts
  • these are contained in web pages at NCBI
  • using the XML package and htmlTreeParse this is a simple operation from within R
  • will form the basis of a more flexible documentation format
  • documentation is really content, how you view the help page is rendering (is HTML, internal R, etc).
  • the ability to selectively run examples with lots of control
  • live documents
  • reports etc can be made into live documents using XML (or similar strategies)
  • see Sweave (Leisch, 2002) in R 1.5.0 or from Fritz’s web site
  • documents can automatically update (daily/weekly etc)
  • most users are interested in compilation because they believe it will increase speed
  • we are interested in it for a variety of reasons
    • understanding how to compile helps us understand how the language functions (where the warts are)
    • virtual machines: JVM, .Net
  • we need to develop a new syllabus for statistical computing courses
  • tools that are needed include
    • computational inference
    • database interactions
    • software design and structure
    • markup languages (and relatives)
the future
The Future
  • statistical computing can develop into a rich subject if it is encouraged
  • encouragement needs to take several different approaches
  • support: financial, career development,
  • statistical computing is a laboratory science, it needs to be funded and run that way
production of code
Production of Code
  • we need to encourage (very strongly) writers of methodology to provide code that implements their methodology
  • the mathematical or theoretical description of a data analytic technique is really worth very little
  • if that technique is implemented then it is much more useful
production of code1
Production of Code
  • the R package system is a reasonable delivery mechanism
  • some design principles will be needed
an example
An Example
  • Bioconductor is a new software initiative
    • www.bioconductor.org
  • among the goals of this project is the deployment of high quality software for the analysis of genomic data
  • the challenges are varied and exciting
genomic data
Genomic Data
  • the data are large; tens of thousands of genes across a few hundred samples
  • the biologists have developed high throughput methods for screening samples
  • we need to develop high throughput methods for analysis
genomic data1
Genomic Data
  • other challenges: much of the data is non-numeric
  • the annotation of genes, their location on the chromosome, deletions, mutations
  • the role of the gene in a particular pathway
  • what do we measure?
    • DNA (the raw thing)
    • mRNA (microarrays – transcribed DNA)
    • protein (proteomics – translated DNA)
  • these data gain value from annotation, from knowledge about adjacent genes or gene products
  • data sources are varied with different formats, error structures etc
tfg b pathway
TFG-b pathway
  • TGF-b (transforming growth factor beta) plays an essential role in the control of development and morphogenesis in multicellular organisms.
  • This is done through SMADS, a family of signal transducers and transcriptional activators.
  • http://www.grt.kyushu-u.ac.jp/spad/
  • There are many open questions regarding the relationship between expression level and pathways.
  • It is not clear whether expression level data will be informative.
  • Ross Ihaka, without whom there would be no R
  • John Chambers, for S and gracious guidance
  • Luke Tierney, Vince Carey, Duncan Temple Lang
  • Dept of Stats, U of Auckland