1 / 11

Apache Arrow

This presentation gives an overview of the Apache Arrow project. It explains the Arrow project in terms of its in memory structure, its purpose, language interfaces and supporting projects. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/

semtechs
Download Presentation

Apache Arrow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Is Apache Arrow ? ● A development platform for in-memory data ● It has a columnar memory format ● It provides efficient analytic operations on modern hardware ● Used for in memory processing ● Cross language support ● Open source / Apache 2.0 license ● Supports zero-copy reads for lightning fast data access

  2. Languages supported ● Arrow supports many languages ● C ● MATLAB ● C++ ● Python ● C# ● R ● Go ● Ruby ● Java ● Rust ● JavaScript

  3. OS Community Support ● Many open source projects support Arrow ● Calcite ● Kudu ● Cassandra ● Pandas ● Drill ● Parquet ● Hadoop ● Phoenix ● HBase ● Spark ● Ibis ● Storm ● Impala

  4. The problem Arrow tackles ● Each system has its own internal memory format ● 70-80% computation wasted – on serialization and de-serialization ● Similar functionality implemented in multiple projects ● Overheads for cross-system communication ● All systems utilize different memory formats

  5. The problem Arrow tackles ● No shared in memory data model

  6. Arrow solves this problem ● All systems utilize the same memory format – In memory – Columnar format – Optimized for modern CPUs and GPUs ● No overhead for cross-system communication ● Projects can share functionality

  7. Arrow solves this problem ● Arrow shared data model

  8. Arrow works with Parquet ● Arrow is an in memory format ● Parquet is designed for disk storage ● Arrow and Parquet are intended to be used together ● Parquet is a columnar file format ● Used for data serialization ● Parquet is a streaming format ● Data must be decoded from start-to-end ● Files are compressed and encoded ● Means smaller files on disk

  9. Arrow Memory Buffer ● Arrow supports data adjacency for sequential access

  10. Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

  11. Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

More Related