1 / 23

Integrating R Language Runtime System with a Data Stream Warehouse

This talk outlines the motivation, system architecture, and benchmark of integrating the R language runtime system with a data stream warehouse. It also discusses bidirectional calls between R and SQL, as well as the challenges and complexities involved.

ernestob
Download Presentation

Integrating R Language Runtime System with a Data Stream Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating the R Language Runtime System with a Data Stream Warehouse

  2. Talk Outline • Motivation • Past stream ATT systems • System architecture • Integrating R runtime with query processor • Bidirectional calls: R calls SQL, SQL calls R • Benchmark of data mapping & transfer

  3. Network Data Streams • Feeds: devices, logs • Timestamps • Intermittent • Arrival our of order • Varying speed • Varying schema • Active processing, but not real-time: <5 mins • Sliding time window

  4. Motivation: SQL • Expressive, standardized. well understood • Efficient, parallel, tunable • Extensible via UDFs

  5. Motivation: Scaling R • Remove RAM limitation • Go beyond 1-threaded processing in 1 node • Parallel processing on multiple nodes • Both worlds • Manage big data in a DBMS • Exploit R math capabilities

  6. Past ATT systems • Gigascope: ultra fast processing stream in NIC (packet level), restricted form of SQL language, no historic tables • DataDepot: store summarized streams, band joins, POSIX file system, compiled SQL queries, integration with feed mgt system, UDFs • TidalRace: Big Data trend, scale out, “V”ariety

  7. TidalRace • HDFS • Large number of nodes • Direct & fast copy of log files, no preprocessing • Multiple asynchronous stream loading • Eventual consistency: MVCC • time-varying schema • Light DBMS for metadata • Integration with stream feed system • Compiled SQL queries

  8. Tidalrace Architecture Data loading and update propagation Maintenance Queries Tidalrace metadata Storage Manager (D3SM) MySQL Data partitions and indices File system (local, D3FS, HDFS)

  9. Temporal Partitioning Index Data New data Time • The primary partitioning field is the record timestamp • Stream data is mostly sorted • Most new data loads into a new partition • Avoid rebuilding indices • Simplified data expiration – roll off oldest partitions

  10. R runtime: challenges • Dynamic storage in RAM, variable generations • Type checking at runtime • RAM constraint to call functions • Data structures: data frames, matrices, vectors • Functional and OO language • Dynamic processing; garbage collector • Runtime based on S language, programmed in C • Block-based processing requires refactoring R libs

  11. Applications

  12. STAR: STream Analytics in R • Separate Unix process • 64 bit memory space • 32 bit int for arrays • Packed binary file • Pipes • Embedded R in C • Compiled query in exec()

  13. Assumptions • Stream velocity handled at ingestion/loading • Acceptable 1-5 minute delay in stream load + analysis • Small size materialized views: time range & aggregation • Large RAM in analytic server • Unlimited size historical tables • Sliding time window: recent data • Separate Unix process: R runtime, compiled query

  14. Mapping Data Types • Atomic • time (POSIX) • int • float (real) • string • Data structures (challenge in SQL, not relational!) • data frame • vector • matrix • list

  15. Data Transfer • Bidirectional pipe: to transform streams into table or to transfer data set • No parsing at all • Packed binary file (varying length strings) • Block-based processing in R (requires reprogramming and wrapping R calls) • Programming: C vs R (speed vs abstraction)

  16. Complexity • Space • Data set: O(dn) • model O(d),O(d2) in RAM • Time • O(dn) to transfer • O(d2n) for many models • Time complexity lower than queries lower than computing a model, same as transforming data set

  17. R calls SQL • Query always has time range: reduce size • Block-based processing to fit in RAM • Packed binary file resembles a packet: header+payload • Always, log data set has timestamps • Every table in SQL can be processed in R, but not every R result can be sent back to DBMS

  18. SQL calls R • Via aggregate UDF, which builds data set • Assumption: most math models take matrices as input. Therefore, given two data set layouts they are converted to matrix form (dense or sparse). • Conversion: table rows with floats are converted to vectors, most tables are converted to matrices, or in general tables with diverse data types are converted to data frames

  19. Examples: use cases • R calls SQL: statistical analyst needs some specific data from the DBMS extracted with comples query. Then computes descriptive statistics and math models • SQL calls R: BI person needs to call some mathematical function in R on a data frame (e.g. smooth a time series) or matrix (get correlation matrix)

  20. Hardware: 4 cores 2Ghz, 4 GB RAM, 1 TB disk (real server much bigger) Software: Linux, R, HDFS, MySQL, GNU C++ Compare read/transfer speed in C and R Compare text (csv) vs binary files Measure throughput (10X faster than query processing) Benchmark: low end equipment

  21. Discussion on performance • Binary files required for high performance: 100X faster than csv files • C 1000X faster than R, but difficult to debug • Disk I/O does not matter for large file because it is sequential access • Data transfer is not a bottleneck (SQL query or R call take >5 seconds on large data set)

  22. Conclusions • Combine SQL queries and R functions seamlessly • Data transfer at maximum speed: reach streaming speed coming from a row DBMS • R can process streams coming from the DBMS, the DBMS can call R in a streaming fashion • Function calls can be fully bidirectional • Any table can be transferred in blocks to R, but only data frames can be transferred from R to DBMS (asymmetric)

  23. Future work • Portable interfacing program with other DBMSs; challenge: source code • Consider alternative storage in DBMS: column, array => data type mapping plus storage conversion • Parallel processing with R on multiple nodes • Evolving models on a stream (time window, visualize) • Debugging dynamic R code in a compiled SQL query

More Related