1 / 13

Apache Beam

This presentation gives an overview of the Apache Beam project. It shows that it is a means of developing generic data pipelines in multiple languages using provided SDK's. The pipelines execute on a range of supported runners/executors. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/

semtechs
Download Presentation

Apache Beam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Is Apache Beam ? ● A unified programming model ● To define and execute data processing pipelines ● For ETL, batch and stream ● Open source / Apache 2.0 license ● Written in Java, Python, Go ● Cross platform support ● Pipelines define using Beam SDK's

  2. How Does Beam Work ? ● Use provided SDK's to define pipelines ● In Java, Python, Go ● Beam SDK isolated in Docker container ● Can be run by any execution runners ● A supported group of runners execute the pipeline ● Capability matrix defines – Relative capabilities of runners – See beam.apache.org for matrix

  3. Beam Programming Guide ? ● A guide for user to create data pipelines ● Examples in Java, Python, Go ● Can design, create and test pipelines ● Provides multi language functions for ● Pcollections ● Windowing ● Transforms ● Triggers ● Pipeline I/O ● Metrics ● Schemas ● State and Timers ● Data encoding / type safety

  4. Beam Pipelines ● When designing pipelines consider – Where data is stored – What does the data look like – What do you want to do with the data – What does your output data look like – Where should the data go ● Use PCollection and PTransform functions to define pipelines

  5. Beam Example Pipelines

  6. Beam Example Pipelines

  7. Beam Runners ● Supported Beam Runners are – Direct Runner (test and development ) – Apache Apex – Apache Flink – Apache Gearpump – Apache Hadoop MapReduce – Apache Nemo – Apache Samza – Apache Spark – Google Cloud Dataflow – Hazelcast Jet – IBM Streams – JStorm

  8. Beam Capability Matrix – What Computed

  9. Beam Capability Matrix – Where Computed

  10. Beam Capability Matrix – When Computed

  11. Beam Capability Matrix – How Computed

  12. Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

  13. Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

More Related