- 491 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'R: An Open Source Statistical Environment' - victoria

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

R: An Open Source Statistical Environment

Valentin Todorov

UNIDO

MSIS 2008 (Luxembourg, 7-9 April 2008)

MSIS 2008, Luxembourg: Valentin Todorov

Outline

- Introduction: the R Platform and Availability
- R Learning Curve (is R hard to learn)
- R Extensibility (R Packages)
- R and the others (Interfaces)
- R Graphics
- R for Time series
- R for Survey Analysis
- R and the Outliers (Robust Statistics in R)
- More R features (WEB, Missing data, OOP, GUI)
- Summary and Conclusions

MSIS 2008, Luxembourg: Valentin Todorov

What is R

- R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities”
- Developed after the S language and environment
- S was developed at Bell Labs (John Chambers et al.)
- S-Plus: a value added implementation of the S language- Insightful Corporation
- much code written for S runs unaltered under R
- Significantly influenced by Scheme, a Lisp dialect

MSIS 2008, Luxembourg: Valentin Todorov

What is R

- Ihaka and Gentleman, University of Auckland (New Zealand)
- 1993 a preliminary version of R
- 1995 released under the GNU Public License
- Now: R-core team consisting of 17 members including John Chambers
- R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques
- R is available as Free Software under the terms of the GNU General Public License (GPL).

MSIS 2008, Luxembourg: Valentin Todorov

R Extensibility (R Packages)

- One of the most important features of R is its extensibility by creating packages of functions and data.
- The R package system provides a framework for developing, documenting, and testing extension code.
- Packages can include R code, documentation, data and foreign code written in C or Fortran.
- Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions.

MSIS 2008, Luxembourg: Valentin Todorov

R and the Others (R Interfaces)

- Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel
- Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign.
- Emulation of Matlab – package matlab.
- Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency
- Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets
- Can use compiled native code in C, C++, Fortran, Java

MSIS 2008, Luxembourg: Valentin Todorov

R Graphics

- One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots.
- The graphics can include mathematical symbols and formulae where needed.
- Can produce graphics in many formats:
- On screen
- PS and PDF for including in LaTex and pdfLaTeX or for distribution
- PNG or JPEG for the Web
- On Windows, metafiles for Word, PowerPoint, etc.

MSIS 2008, Luxembourg: Valentin Todorov

R Graphics: basic and multipanel plots (trellis)

MSIS 2008, Luxembourg: Valentin Todorov

R Graphics: parallel plot and coplot

MSIS 2008, Luxembourg: Valentin Todorov

R for Time Series

- Package stats
- classical time series modeling tools – arima() for Box-Jenkins type analysis
- structural time series – StructTS()
- filtering and decomposition – decompose() and HoltWinters()
- Package forecast – additional forecast methods and graphical tools
- Analyzing monthly or lower frequency time series:
- TRAMO/SEATS
- X-12-ARIMA
- accessible through the Gretl library
- Task View Econometrics:http://cran.r-project.org/web/views/Econometrics.html

MSIS 2008, Luxembourg: Valentin Todorov

R for Time Series: Example

- Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic

MSIS 2008, Luxembourg: Valentin Todorov

R for Survey Analysis

- Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc.
- STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages

MSIS 2008, Luxembourg: Valentin Todorov

R for Survey Analysis

- R – package survey - http://faculty.washington.edu/tlumley/survey/
- stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements
- Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains
- Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied)
- Graphics: histograms, hexbin scatterplots, smoothers
- Other packages: pps, sampling, sampfling

MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers (Robust Statistics in R)

- What are Outliers
- atypical observations which are inconsistent with the rest of the data or deviate from the postulated model
- may arise through contamination, errors in data gathering, or misspecification of the model.
- classical statistical methods are very sensitive to such data
- What are Robust methods
- Produce reasonable results even when one or more outliers may appear in the data
- Robust regression - robustbase
- Robust multivariate methods – rrcov, robustbase
- Robust time series analysis - robust-ts

MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example

- Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/
- a national sample of 6000 households with a male head earning less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups
- estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents:
- We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model

MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example OLS

MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example LTS

MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example Covariance

- Marona & Yohai (1998)
- rrcov: data set maryo
- A bivariate data set with:
- sample correlation: 0.81
- interchange the largest and smallest value in the first coordinate
- the sample correlation becomes 0.05

MSIS 2008, Luxembourg: Valentin Todorov

More R…

- R and the WEB - several projects that provide possibilities to use R over the WEB
- R and the Missing – advanced missing value handling
- mvnmle: ML estimation for multivariate data with missing values
- mitools: Tools for multiple imputation of missing data
- mice - Multivariate Imputation by Chained Equations
- EMV: Estimation of Missing Values for a Data Matrix
- VIM: provides methods for the visualisation as well as imputation of missing data
- R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#)

MSIS 2008, Luxembourg: Valentin Todorov

More R…

- R GUI
- R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields
- Sciviews: a suite of companion applications for Windows
- R and SDMX
- R Reports
- package xtable: coerce data to LaTeX and HTML tables
- package Sweave: a framework for mixing text and R code for automatic report gene

MSIS 2008, Luxembourg: Valentin Todorov

Summary

- Output Management System
- SAS/SPSS: it is rarely used for routine work
- R: output is easily passed from one function to another to do further processing and to obtain more results
- Macro Language
- SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures
- R itself is a programming language
- Matrix Language
- SAS/SPSS: A special language with own syntax
- R is a vector and matrix based language complemented by additional packages: Matitrx, SparseM

MSIS 2008, Luxembourg: Valentin Todorov

Summary (cont.)

- Publishing results
- SAS/SPSS: Cut and paste to a Word processor or exporting to a file
- R: produce LaTex output (including graphics) using for example the Sweave package
- Data size
- SAS/SPSS: Limited by the size of the disk
- R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible
- Data structure
- SAS/SPSS: Rectangular data set
- R: Rectangular data frame, vector, list

MSIS 2008, Luxembourg: Valentin Todorov

Summary (cont.)

- Interface to other programming languages
- SAS/SPSS: Not available
- R: R can be easily mixed with Fortran, C, C++ and Java
- Source code
- SAS/SPSS: Not available
- R: the source code of R itself as well as of its packages is a part of the distribution

MSIS 2008, Luxembourg: Valentin Todorov

References

- Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20 2 pp 197-202
- Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html
- López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081
- Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf
- Murrel, P. (2005) R Graphics, Chapman & Hall
- R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/
- Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication
- Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!

MSIS 2008, Luxembourg: Valentin Todorov

Download Presentation

Connecting to Server..