1 / 14

Apache Gobblin

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/<br>

semtechs
Download Presentation

Apache Gobblin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Is Apache Gobblin ? ● A big data integration framework ● To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management ● For streaming and batch ● An Apache incubator project

  2. Gobblin Execution Modes ● Gobblin has a number of execution modes ● Standalone – Run on a single box / JVM / embedded mode ● Map Reduce – Run as a map reduce application ● Yarn / Mesos ( proposed ? ) –Run on a cluster via a scheduler, supports HA ● Cloud – Run on AWS / Azure, supports HA

  3. Gobblin Sinks/Writers ● Gobblin supports the following sinks –Avro HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka

  4. Gobblin Sources Gobblin supports the following sources ●Avro files ●JSON ● File copy ● Kafka ● Query based ● MySQL ● Rest API ● Oracle ● Google Analytics ● Salesforce ● Google drive ● FTP / SFTP ● Google webmaster ● SQL Server ● Hadoop text input ● Teradata ● Hive Avro to ORC ● Wikipedia ● Hive compliance purging

  5. Gobblin Architecture

  6. Gobblin Architecture ● A Gobblin job is built on a set of plugable constructs ● Which are extensible ● A job is a set of tasks created from a workunit ● The workunit serves as a container at runtime ● Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce ● Run time handles scheduling, error handling etc ● Utilities handle meta data, state, metrics etc

  7. Gobblin Job

  8. Gobblin Job ● Optional aquire lock (to stop next job instance) ● Create source instance ● From source work units create tasks ● Launch and run tasks ● Publish data if OK to do so ● Persist the job/task states into the state store ● Clean up temporary work data ● Release the job lock ( optional )

  9. Gobblin Constructs

  10. Gobblin Constructs ● Source partitions data into work units ● Source creates work unit data extractors ● Converter converts schema and data records ● Quality checker checks row and task level data ● Fork operator allows control to flow into multiple streams ● Writers sends data records to sink ● Publisher publishes job records

  11. Gobblin Job Configuration ● Goblin jobs are configured via configuration files ● May be named .pull / .job plus .properties ● Source properties file defines – Connection / converter / quality / publisher ● Job file defines – Name / group / description / schedule – Extraction properties – Source properties

  12. Gobblin Users

  13. Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

  14. Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

More Related