R an open source statistical environment
Download
1 / 24

R: An Open Source Statistical Environment - PowerPoint PPT Presentation


  • 484 Views
  • Updated On :

R: An Open Source Statistical Environment. Valentin Todorov UNIDO [email protected] MSIS 2008 (Luxembourg, 7-9 April 2008). Outline . Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'R: An Open Source Statistical Environment' - victoria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
R an open source statistical environment l.jpg
R: An Open Source Statistical Environment

Valentin Todorov

UNIDO

[email protected]

MSIS 2008 (Luxembourg, 7-9 April 2008)

MSIS 2008, Luxembourg: Valentin Todorov


Outline l.jpg
Outline

  • Introduction: the R Platform and Availability

  • R Learning Curve (is R hard to learn)

  • R Extensibility (R Packages)

  • R and the others (Interfaces)

  • R Graphics

  • R for Time series

  • R for Survey Analysis

  • R and the Outliers (Robust Statistics in R)

  • More R features (WEB, Missing data, OOP, GUI)

  • Summary and Conclusions

MSIS 2008, Luxembourg: Valentin Todorov


What is r l.jpg
What is R

  • R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities”

  • Developed after the S language and environment

    • S was developed at Bell Labs (John Chambers et al.)

    • S-Plus: a value added implementation of the S language- Insightful Corporation

    • much code written for S runs unaltered under R

  • Significantly influenced by Scheme, a Lisp dialect

MSIS 2008, Luxembourg: Valentin Todorov


What is r4 l.jpg
What is R

  • Ihaka and Gentleman, University of Auckland (New Zealand)

    • 1993 a preliminary version of R

    • 1995 released under the GNU Public License

    • Now: R-core team consisting of 17 members including John Chambers

  • R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques

  • R is available as Free Software under the terms of the GNU General Public License (GPL).

MSIS 2008, Luxembourg: Valentin Todorov


R extensibility r packages l.jpg
R Extensibility (R Packages)

  • One of the most important features of R is its extensibility by creating packages of functions and data.

  • The R package system provides a framework for developing, documenting, and testing extension code.

  • Packages can include R code, documentation, data and foreign code written in C or Fortran.

  • Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions.

MSIS 2008, Luxembourg: Valentin Todorov


R and the others r interfaces l.jpg
R and the Others (R Interfaces)

  • Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel

  • Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign.

  • Emulation of Matlab – package matlab.

  • Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency

  • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets

  • Can use compiled native code in C, C++, Fortran, Java

MSIS 2008, Luxembourg: Valentin Todorov


R graphics l.jpg
R Graphics

  • One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots.

  • The graphics can include mathematical symbols and formulae where needed.

  • Can produce graphics in many formats:

    • On screen

    • PS and PDF for including in LaTex and pdfLaTeX or for distribution

    • PNG or JPEG for the Web

    • On Windows, metafiles for Word, PowerPoint, etc.

MSIS 2008, Luxembourg: Valentin Todorov


R graphics basic and multipanel plots trellis l.jpg
R Graphics: basic and multipanel plots (trellis)

MSIS 2008, Luxembourg: Valentin Todorov


R graphics parallel plot and coplot l.jpg
R Graphics: parallel plot and coplot

MSIS 2008, Luxembourg: Valentin Todorov


R for time series l.jpg
R for Time Series

  • Package stats

    • classical time series modeling tools – arima() for Box-Jenkins type analysis

    • structural time series – StructTS()

    • filtering and decomposition – decompose() and HoltWinters()

  • Package forecast – additional forecast methods and graphical tools

  • Analyzing monthly or lower frequency time series:

    • TRAMO/SEATS

    • X-12-ARIMA

      • accessible through the Gretl library

  • Task View Econometrics:http://cran.r-project.org/web/views/Econometrics.html

MSIS 2008, Luxembourg: Valentin Todorov


R for time series example l.jpg
R for Time Series: Example

  • Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic

MSIS 2008, Luxembourg: Valentin Todorov


R for survey analysis l.jpg
R for Survey Analysis

  • Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc.

  • STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages

MSIS 2008, Luxembourg: Valentin Todorov


R for survey analysis13 l.jpg
R for Survey Analysis

  • R – package survey - http://faculty.washington.edu/tlumley/survey/

    • stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements

    • Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains

    • Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied)

    • Graphics: histograms, hexbin scatterplots, smoothers

  • Other packages: pps, sampling, sampfling

MSIS 2008, Luxembourg: Valentin Todorov


R and the outliers robust statistics in r l.jpg
R and the Outliers (Robust Statistics in R)

  • What are Outliers

    • atypical observations which are inconsistent with the rest of the data or deviate from the postulated model

    • may arise through contamination, errors in data gathering, or misspecification of the model.

    • classical statistical methods are very sensitive to such data

  • What are Robust methods

    • Produce reasonable results even when one or more outliers may appear in the data

    • Robust regression - robustbase

    • Robust multivariate methods – rrcov, robustbase

    • Robust time series analysis - robust-ts

MSIS 2008, Luxembourg: Valentin Todorov


R and the outliers example l.jpg
R and the Outliers: Example

  • Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/

    • a national sample of 6000 households with a male head earning less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups

    • estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents:

    • We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model

MSIS 2008, Luxembourg: Valentin Todorov


R and the outliers example ols l.jpg
R and the Outliers: Example OLS

MSIS 2008, Luxembourg: Valentin Todorov


R and the outliers example lts l.jpg
R and the Outliers: Example LTS

MSIS 2008, Luxembourg: Valentin Todorov


R and the outliers example covariance l.jpg
R and the Outliers: Example Covariance

  • Marona & Yohai (1998)

  • rrcov: data set maryo

  • A bivariate data set with:

  • sample correlation: 0.81

  • interchange the largest and smallest value in the first coordinate

  • the sample correlation becomes 0.05

MSIS 2008, Luxembourg: Valentin Todorov


More r l.jpg
More R…

  • R and the WEB - several projects that provide possibilities to use R over the WEB

  • R and the Missing – advanced missing value handling

    • mvnmle: ML estimation for multivariate data with missing values

    • mitools: Tools for multiple imputation of missing data

    • mice - Multivariate Imputation by Chained Equations

    • EMV: Estimation of Missing Values for a Data Matrix

    • VIM: provides methods for the visualisation as well as imputation of missing data

  • R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#)

MSIS 2008, Luxembourg: Valentin Todorov


More r20 l.jpg
More R…

  • R GUI

    • R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields

    • Sciviews: a suite of companion applications for Windows

  • R and SDMX

  • R Reports

    • package xtable: coerce data to LaTeX and HTML tables

    • package Sweave: a framework for mixing text and R code for automatic report gene

MSIS 2008, Luxembourg: Valentin Todorov


Summary l.jpg
Summary

  • Output Management System

    • SAS/SPSS: it is rarely used for routine work

    • R: output is easily passed from one function to another to do further processing and to obtain more results

  • Macro Language

    • SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures

    • R itself is a programming language

  • Matrix Language

    • SAS/SPSS: A special language with own syntax

    • R is a vector and matrix based language complemented by additional packages: Matitrx, SparseM

MSIS 2008, Luxembourg: Valentin Todorov


Summary cont l.jpg
Summary (cont.)

  • Publishing results

    • SAS/SPSS: Cut and paste to a Word processor or exporting to a file

    • R: produce LaTex output (including graphics) using for example the Sweave package

  • Data size

    • SAS/SPSS: Limited by the size of the disk

    • R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible

  • Data structure

    • SAS/SPSS: Rectangular data set

    • R: Rectangular data frame, vector, list

MSIS 2008, Luxembourg: Valentin Todorov


Summary cont23 l.jpg
Summary (cont.)

  • Interface to other programming languages

    • SAS/SPSS: Not available

    • R: R can be easily mixed with Fortran, C, C++ and Java

  • Source code

    • SAS/SPSS: Not available

    • R: the source code of R itself as well as of its packages is a part of the distribution

MSIS 2008, Luxembourg: Valentin Todorov


References l.jpg
References

  • Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20 2 pp 197-202

  • Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html

  • López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081

  • Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf

  • Murrel, P. (2005) R Graphics, Chapman & Hall

  • R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/

  • Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication

  • Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!

MSIS 2008, Luxembourg: Valentin Todorov


ad