r an open source statistical environment
Download
Skip this Video
Download Presentation
R: An Open Source Statistical Environment

Loading in 2 Seconds...

play fullscreen
1 / 24

R: An Open Source Statistical Environment - PowerPoint PPT Presentation


  • 491 Views
  • Uploaded on

R: An Open Source Statistical Environment. Valentin Todorov UNIDO [email protected] MSIS 2008 (Luxembourg, 7-9 April 2008). Outline . Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'R: An Open Source Statistical Environment' - victoria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
r an open source statistical environment
R: An Open Source Statistical Environment

Valentin Todorov

UNIDO

[email protected]

MSIS 2008 (Luxembourg, 7-9 April 2008)

MSIS 2008, Luxembourg: Valentin Todorov

outline
Outline
  • Introduction: the R Platform and Availability
  • R Learning Curve (is R hard to learn)
  • R Extensibility (R Packages)
  • R and the others (Interfaces)
  • R Graphics
  • R for Time series
  • R for Survey Analysis
  • R and the Outliers (Robust Statistics in R)
  • More R features (WEB, Missing data, OOP, GUI)
  • Summary and Conclusions

MSIS 2008, Luxembourg: Valentin Todorov

what is r
What is R
  • R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities”
  • Developed after the S language and environment
    • S was developed at Bell Labs (John Chambers et al.)
    • S-Plus: a value added implementation of the S language- Insightful Corporation
    • much code written for S runs unaltered under R
  • Significantly influenced by Scheme, a Lisp dialect

MSIS 2008, Luxembourg: Valentin Todorov

what is r4
What is R
  • Ihaka and Gentleman, University of Auckland (New Zealand)
    • 1993 a preliminary version of R
    • 1995 released under the GNU Public License
    • Now: R-core team consisting of 17 members including John Chambers
  • R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques
  • R is available as Free Software under the terms of the GNU General Public License (GPL).

MSIS 2008, Luxembourg: Valentin Todorov

r extensibility r packages
R Extensibility (R Packages)
  • One of the most important features of R is its extensibility by creating packages of functions and data.
  • The R package system provides a framework for developing, documenting, and testing extension code.
  • Packages can include R code, documentation, data and foreign code written in C or Fortran.
  • Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions.

MSIS 2008, Luxembourg: Valentin Todorov

r and the others r interfaces
R and the Others (R Interfaces)
  • Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel
  • Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign.
  • Emulation of Matlab – package matlab.
  • Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency
  • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets
  • Can use compiled native code in C, C++, Fortran, Java

MSIS 2008, Luxembourg: Valentin Todorov

r graphics
R Graphics
  • One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots.
  • The graphics can include mathematical symbols and formulae where needed.
  • Can produce graphics in many formats:
    • On screen
    • PS and PDF for including in LaTex and pdfLaTeX or for distribution
    • PNG or JPEG for the Web
    • On Windows, metafiles for Word, PowerPoint, etc.

MSIS 2008, Luxembourg: Valentin Todorov

r graphics basic and multipanel plots trellis
R Graphics: basic and multipanel plots (trellis)

MSIS 2008, Luxembourg: Valentin Todorov

r graphics parallel plot and coplot
R Graphics: parallel plot and coplot

MSIS 2008, Luxembourg: Valentin Todorov

r for time series
R for Time Series
  • Package stats
    • classical time series modeling tools – arima() for Box-Jenkins type analysis
    • structural time series – StructTS()
    • filtering and decomposition – decompose() and HoltWinters()
  • Package forecast – additional forecast methods and graphical tools
  • Analyzing monthly or lower frequency time series:
    • TRAMO/SEATS
    • X-12-ARIMA
      • accessible through the Gretl library
  • Task View Econometrics:http://cran.r-project.org/web/views/Econometrics.html

MSIS 2008, Luxembourg: Valentin Todorov

r for time series example
R for Time Series: Example
  • Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic

MSIS 2008, Luxembourg: Valentin Todorov

r for survey analysis
R for Survey Analysis
  • Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc.
  • STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages

MSIS 2008, Luxembourg: Valentin Todorov

r for survey analysis13
R for Survey Analysis
  • R – package survey - http://faculty.washington.edu/tlumley/survey/
    • stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements
    • Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains
    • Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied)
    • Graphics: histograms, hexbin scatterplots, smoothers
  • Other packages: pps, sampling, sampfling

MSIS 2008, Luxembourg: Valentin Todorov

r and the outliers robust statistics in r
R and the Outliers (Robust Statistics in R)
  • What are Outliers
    • atypical observations which are inconsistent with the rest of the data or deviate from the postulated model
    • may arise through contamination, errors in data gathering, or misspecification of the model.
    • classical statistical methods are very sensitive to such data
  • What are Robust methods
    • Produce reasonable results even when one or more outliers may appear in the data
    • Robust regression - robustbase
    • Robust multivariate methods – rrcov, robustbase
    • Robust time series analysis - robust-ts

MSIS 2008, Luxembourg: Valentin Todorov

r and the outliers example
R and the Outliers: Example
  • Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/
    • a national sample of 6000 households with a male head earning less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups
    • estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents:
    • We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model

MSIS 2008, Luxembourg: Valentin Todorov

r and the outliers example ols
R and the Outliers: Example OLS

MSIS 2008, Luxembourg: Valentin Todorov

r and the outliers example lts
R and the Outliers: Example LTS

MSIS 2008, Luxembourg: Valentin Todorov

r and the outliers example covariance
R and the Outliers: Example Covariance
  • Marona & Yohai (1998)
  • rrcov: data set maryo
  • A bivariate data set with:
  • sample correlation: 0.81
  • interchange the largest and smallest value in the first coordinate
  • the sample correlation becomes 0.05

MSIS 2008, Luxembourg: Valentin Todorov

more r
More R…
  • R and the WEB - several projects that provide possibilities to use R over the WEB
  • R and the Missing – advanced missing value handling
    • mvnmle: ML estimation for multivariate data with missing values
    • mitools: Tools for multiple imputation of missing data
    • mice - Multivariate Imputation by Chained Equations
    • EMV: Estimation of Missing Values for a Data Matrix
    • VIM: provides methods for the visualisation as well as imputation of missing data
  • R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#)

MSIS 2008, Luxembourg: Valentin Todorov

more r20
More R…
  • R GUI
    • R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields
    • Sciviews: a suite of companion applications for Windows
  • R and SDMX
  • R Reports
    • package xtable: coerce data to LaTeX and HTML tables
    • package Sweave: a framework for mixing text and R code for automatic report gene

MSIS 2008, Luxembourg: Valentin Todorov

summary
Summary
  • Output Management System
    • SAS/SPSS: it is rarely used for routine work
    • R: output is easily passed from one function to another to do further processing and to obtain more results
  • Macro Language
    • SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures
    • R itself is a programming language
  • Matrix Language
    • SAS/SPSS: A special language with own syntax
    • R is a vector and matrix based language complemented by additional packages: Matitrx, SparseM

MSIS 2008, Luxembourg: Valentin Todorov

summary cont
Summary (cont.)
  • Publishing results
    • SAS/SPSS: Cut and paste to a Word processor or exporting to a file
    • R: produce LaTex output (including graphics) using for example the Sweave package
  • Data size
    • SAS/SPSS: Limited by the size of the disk
    • R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible
  • Data structure
    • SAS/SPSS: Rectangular data set
    • R: Rectangular data frame, vector, list

MSIS 2008, Luxembourg: Valentin Todorov

summary cont23
Summary (cont.)
  • Interface to other programming languages
    • SAS/SPSS: Not available
    • R: R can be easily mixed with Fortran, C, C++ and Java
  • Source code
    • SAS/SPSS: Not available
    • R: the source code of R itself as well as of its packages is a part of the distribution

MSIS 2008, Luxembourg: Valentin Todorov

references
References
  • Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20 2 pp 197-202
  • Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html
  • López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081
  • Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf
  • Murrel, P. (2005) R Graphics, Chapman & Hall
  • R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/
  • Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication
  • Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!

MSIS 2008, Luxembourg: Valentin Todorov

ad