1 / 38

Ricardo: Integrating R and Hadoop

Ricardo: Integrating R and Hadoop. Angel Trifonov Yun Lu Ying Wang. Contents. Introduction Motivating Examples Preliminaries Ricardo Design Experimental Study Conclusion. Introduction. Data collection. Enterprise datasets Why are these datasets important?

dannon
Download Presentation

Ricardo: Integrating R and Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ricardo: Integrating R and Hadoop Angel Trifonov Yun Lu Ying Wang

  2. Contents • Introduction • Motivating Examples • Preliminaries • Ricardo Design • Experimental Study • Conclusion

  3. Introduction

  4. Data collection • Enterprise datasets • Why are these datasets important? • Statistical analysis on datasets • Data analyst workflow • Explore/summarize data • Built a model • Used to improve business practices • Need a statistical package

  5. R and dms • R design • Single server • Main memory • Large data  FAIL! • Problem for analysts – they work with large datasets • Vertical scalability • Subsets • Neither is ideal! • Large-scale data management systems (DMS) • Example: Hadoop • Aggregation processing

  6. ricardo • Overview • Scalable platform for deep analytics • Part of eXtreme Analytics Platform (XAP) project • Named after economist David Ricardo • Facilitates trading between R and Hadoop • Previous work on Map-Reduce • Small data – combined approach success • Several advantages

  7. Ricardo advantages • Familiar working environment – work within a statistical environment • Data attraction – Hadoop’sflexible data store together with the Jaql query language • Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it • Reliability and community support – built from open-source projects • Improved user code – facilitates better code • Deep analytics –can handle many kinds of advanced statistical analyses • No re-inventing of wheels – combine existing statistical and DMS technology

  8. Motivating examples

  9. example 1: Simple trading • Analyst workflow: exploration • Graph shows movie perception over time • How does an analyst get this data visualization? • R is good for the job, BUT… • Ricardo can help!

  10. example 2: Simple trading • Analyst workflow: evaluation – already have a model • Analysis must be on all the data • Ricardo can help once again • What did we see? • Simple trading • First case  pass to R • Second case  pass to Hadoop • More complicated analyses? No problem!

  11. example 3: complex trading • Analyst workflow: modeling • How? • Simple-trading scheme  no good • Losing information • Ricardo permits complex trading • Data needs decomposition • Small parts  handled by R • Large parts  handled by Hadoop • Consider an example • Latent-factor model • Each piece of data must be taken into account • Simple-trading won’t work

  12. Latent-factor model

  13. preliminaries

  14. The R project • Developed at the University of Auckland, New Zealand • Open-source language and statistical environment • Small maintenance team, but big popularity • Example of functionality: fit <- lm(df$mean ~ df$year) plot(df$year, df$mean) abline(fit) • Data frame equivalent

  15. Large-scale dms • Enterprise data warehouses – dominant type of DMS • Designed for clean/structured data – not good • Analysts want their data dirty • What to do? Use Hadoop! • Hadoop method • Hadoop Distributed File System • Operates on raw data files • Process according to MapReduce • Map phase results fed to reducer • Used successfully on large-scale datasets • Appealing alternative

  16. Jaql: A JSON Query Language • Hadoop drawback – programming interface • Attempts to help this • Ricardo uses Jaql • Open-source dataflow language • Jaql scripts automatically compiled • Operates directly on data files • JSON view: [{ customer: "Michael", movie: { name: "About Schmidt", year: 2002}, rating: 5}, ...], • Jaql query: read("ratings") -> group by year = $.movie.year into { year, mean: avg($[*].rating) } -> sort by [ $.year ].

  17. Ricardo design

  18. Problem Statement How to bridge between them? Advantage: -Large scale processing Disadvantage: Insufficient analytical functionality • Advantages: • -Statistical • software • -Data analysis • Disadvantages: • Operate in main memory • Limited data

  19. Ricardo Design

  20. Ricardo Design • R driver: • Not memory-resident • Does R need memory to store some data? • Hadoop : • Performance operations • Store data in HDFS • R-Jaql Bridge: • Connect between R driver and Hadoop cluster • Execute query (what kind of query?) • Send the result back to R as data frames • Allow Jaql queries to spawn R processes on Hadoop worker nodes.

  21. R-JaqlBridge • Components: • R package(Jaql R and a Jaql module: R Jaql)  R  Hadoop  Hadoop  R  Hadoop  R  R  Hadoop

  22. Ricardo Workflow • Analyst’s typical workflow • Data exploration • Preliminary observation • Simple trading • Model building • Depth Analytics • Complex trading • Model evaluation • Quality of models • Simple trading Why model building is complex trading?

  23. Review Example • Movies recommendation Data Exploration Model Building Complex Trading: Latent-Factor Model Simple Trading: Linear Regression

  24. Simple Trading – Linear Regression Get data from Hadoop Fit data

  25. Simple Trading – Evaluate Model Fit data Select top 10 outliers

  26. Complex Trading Model Building Objectives

  27. Model Building Random pick up p and q Set up optimization method • Compute • Squared error (e) • The derivative of e with respect to p • The derivative of e with respect to q Update p and q Repeat it until convergence

  28. Model Building • Table r: stores ratings • Table p and q: stores latent factors Table q Table r Table p

  29. Details Compute the sum of squared errors Compute the gradient

  30. Other Models • Principal component analysis (PCA) • Compute eigenvectors and eigenvalues • Perpendicular among eigenvectors • GLM • Compute response variable • Expressed as a nonlinear function • ……

  31. Implementations • Java Native Interface (JNI) as the bridge between C and Java • How to transfer the data between JNI? • Naïve way • Better solution • Japl wrapper handles data-representation incompatibilities • This is in the bridge • What’s the component right now in the R-Jaql bridge now?

  32. Experimental study

  33. Experimental study

  34. Experimental study

  35. Experimental study

  36. Related work • Scaling Out R • Low level message passing type • Task- and data-parallel computing systems • Automatic parallelization of high-level • Deeping a DMS

  37. conclusion • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. • Future work • Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach.

  38. references • S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010.

More Related