1 / 24

R: An Open Source Statistical Environment

R: An Open Source Statistical Environment. Valentin Todorov UNIDO v.todorov@unido.org. MSIS 2008 (Luxembourg, 7-9 April 2008). Outline . Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces)

duncan-day
Download Presentation

R: An Open Source Statistical Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R: An Open Source Statistical Environment Valentin Todorov UNIDO v.todorov@unido.org MSIS 2008 (Luxembourg, 7-9 April 2008) MSIS 2008, Luxembourg: Valentin Todorov

  2. Outline • Introduction: the R Platform and Availability • R Learning Curve (is R hard to learn) • R Extensibility (R Packages) • R and the others (Interfaces) • R Graphics • R for Time series • R for Survey Analysis • R and the Outliers (Robust Statistics in R) • More R features (WEB, Missing data, OOP, GUI) • Summary and Conclusions MSIS 2008, Luxembourg: Valentin Todorov

  3. What is R • R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities” • Developed after the S language and environment • S was developed at Bell Labs (John Chambers et al.) • S-Plus: a value added implementation of the S language- Insightful Corporation • much code written for S runs unaltered under R • Significantly influenced by Scheme, a Lisp dialect MSIS 2008, Luxembourg: Valentin Todorov

  4. What is R • Ihaka and Gentleman, University of Auckland (New Zealand) • 1993 a preliminary version of R • 1995 released under the GNU Public License • Now: R-core team consisting of 17 members including John Chambers • R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques • R is available as Free Software under the terms of the GNU General Public License (GPL). MSIS 2008, Luxembourg: Valentin Todorov

  5. R Extensibility (R Packages) • One of the most important features of R is its extensibility by creating packages of functions and data. • The R package system provides a framework for developing, documenting, and testing extension code. • Packages can include R code, documentation, data and foreign code written in C or Fortran. • Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions. MSIS 2008, Luxembourg: Valentin Todorov

  6. R and the Others (R Interfaces) • Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel • Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign. • Emulation of Matlab – package matlab. • Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets • Can use compiled native code in C, C++, Fortran, Java MSIS 2008, Luxembourg: Valentin Todorov

  7. R Graphics • One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots. • The graphics can include mathematical symbols and formulae where needed. • Can produce graphics in many formats: • On screen • PS and PDF for including in LaTex and pdfLaTeX or for distribution • PNG or JPEG for the Web • On Windows, metafiles for Word, PowerPoint, etc. MSIS 2008, Luxembourg: Valentin Todorov

  8. R Graphics: basic and multipanel plots (trellis) MSIS 2008, Luxembourg: Valentin Todorov

  9. R Graphics: parallel plot and coplot MSIS 2008, Luxembourg: Valentin Todorov

  10. R for Time Series • Package stats • classical time series modeling tools – arima() for Box-Jenkins type analysis • structural time series – StructTS() • filtering and decomposition – decompose() and HoltWinters() • Package forecast – additional forecast methods and graphical tools • Analyzing monthly or lower frequency time series: • TRAMO/SEATS • X-12-ARIMA • accessible through the Gretl library • Task View Econometrics:http://cran.r-project.org/web/views/Econometrics.html MSIS 2008, Luxembourg: Valentin Todorov

  11. R for Time Series: Example • Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic MSIS 2008, Luxembourg: Valentin Todorov

  12. R for Survey Analysis • Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc. • STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages MSIS 2008, Luxembourg: Valentin Todorov

  13. R for Survey Analysis • R – package survey - http://faculty.washington.edu/tlumley/survey/ • stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements • Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains • Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied) • Graphics: histograms, hexbin scatterplots, smoothers • Other packages: pps, sampling, sampfling MSIS 2008, Luxembourg: Valentin Todorov

  14. R and the Outliers (Robust Statistics in R) • What are Outliers • atypical observations which are inconsistent with the rest of the data or deviate from the postulated model • may arise through contamination, errors in data gathering, or misspecification of the model. • classical statistical methods are very sensitive to such data • What are Robust methods • Produce reasonable results even when one or more outliers may appear in the data • Robust regression - robustbase • Robust multivariate methods – rrcov, robustbase • Robust time series analysis - robust-ts MSIS 2008, Luxembourg: Valentin Todorov

  15. R and the Outliers: Example • Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/ • a national sample of 6000 households with a male head earning less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups • estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents: • We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model MSIS 2008, Luxembourg: Valentin Todorov

  16. R and the Outliers: Example OLS MSIS 2008, Luxembourg: Valentin Todorov

  17. R and the Outliers: Example LTS MSIS 2008, Luxembourg: Valentin Todorov

  18. R and the Outliers: Example Covariance • Marona & Yohai (1998) • rrcov: data set maryo • A bivariate data set with: • sample correlation: 0.81 • interchange the largest and smallest value in the first coordinate • the sample correlation becomes 0.05 MSIS 2008, Luxembourg: Valentin Todorov

  19. More R… • R and the WEB - several projects that provide possibilities to use R over the WEB • R and the Missing – advanced missing value handling • mvnmle: ML estimation for multivariate data with missing values • mitools: Tools for multiple imputation of missing data • mice - Multivariate Imputation by Chained Equations • EMV: Estimation of Missing Values for a Data Matrix • VIM: provides methods for the visualisation as well as imputation of missing data • R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#) MSIS 2008, Luxembourg: Valentin Todorov

  20. More R… • R GUI • R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields • Sciviews: a suite of companion applications for Windows • R and SDMX • R Reports • package xtable: coerce data to LaTeX and HTML tables • package Sweave: a framework for mixing text and R code for automatic report gene MSIS 2008, Luxembourg: Valentin Todorov

  21. Summary • Output Management System • SAS/SPSS: it is rarely used for routine work • R: output is easily passed from one function to another to do further processing and to obtain more results • Macro Language • SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures • R itself is a programming language • Matrix Language • SAS/SPSS: A special language with own syntax • R is a vector and matrix based language complemented by additional packages: Matitrx, SparseM MSIS 2008, Luxembourg: Valentin Todorov

  22. Summary (cont.) • Publishing results • SAS/SPSS: Cut and paste to a Word processor or exporting to a file • R: produce LaTex output (including graphics) using for example the Sweave package • Data size • SAS/SPSS: Limited by the size of the disk • R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible • Data structure • SAS/SPSS: Rectangular data set • R: Rectangular data frame, vector, list MSIS 2008, Luxembourg: Valentin Todorov

  23. Summary (cont.) • Interface to other programming languages • SAS/SPSS: Not available • R: R can be easily mixed with Fortran, C, C++ and Java • Source code • SAS/SPSS: Not available • R: the source code of R itself as well as of its packages is a part of the distribution MSIS 2008, Luxembourg: Valentin Todorov

  24. References • Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20 2 pp 197-202 • Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html • López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081 • Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf • Murrel, P. (2005) R Graphics, Chapman & Hall • R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/ • Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication • Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers! MSIS 2008, Luxembourg: Valentin Todorov

More Related