Apache Gobblin

What Is Apache Gobblin ? ● A big data integration framework ● To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management ● For streaming and batch ● An Apache incubator project

Gobblin Execution Modes ● Gobblin has a number of execution modes ● Standalone – Run on a single box / JVM / embedded mode ● Map Reduce – Run as a map reduce application ● Yarn / Mesos ( proposed ? ) –Run on a cluster via a scheduler, supports HA ● Cloud – Run on AWS / Azure, supports HA

Gobblin Sinks/Writers ● Gobblin supports the following sinks –Avro HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka

Gobblin Sources Gobblin supports the following sources ●Avro files ●JSON ● File copy ● Kafka ● Query based ● MySQL ● Rest API ● Oracle ● Google Analytics ● Salesforce ● Google drive ● FTP / SFTP ● Google webmaster ● SQL Server ● Hadoop text input ● Teradata ● Hive Avro to ORC ● Wikipedia ● Hive compliance purging

Gobblin Architecture

Gobblin Architecture ● A Gobblin job is built on a set of plugable constructs ● Which are extensible ● A job is a set of tasks created from a workunit ● The workunit serves as a container at runtime ● Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce ● Run time handles scheduling, error handling etc ● Utilities handle meta data, state, metrics etc

Gobblin Job

Gobblin Job ● Optional aquire lock (to stop next job instance) ● Create source instance ● From source work units create tasks ● Launch and run tasks ● Publish data if OK to do so ● Persist the job/task states into the state store ● Clean up temporary work data ● Release the job lock ( optional )

Gobblin Constructs

Gobblin Constructs ● Source partitions data into work units ● Source creates work unit data extractors ● Converter converts schema and data records ● Quality checker checks row and task level data ● Fork operator allows control to flow into multiple streams ● Writers sends data records to sink ● Publisher publishes job records

Gobblin Job Configuration ● Goblin jobs are configured via configuration files ● May be named .pull / .job plus .properties ● Source properties file defines – Connection / converter / quality / publisher ● Job file defines – Name / group / description / schedule – Extraction properties – Source properties

Gobblin Users

Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Gobblin

Apache Gobblin

Presentation Transcript

Apache Ant

Apache Sandesha and Apache Axis2

Apache

Apache

Apache

Apache

Apache

The apache

Apache

Apache Mesos

APACHE

Apache POI

Apache Axis:

Apache

Apache

Apache

APACHE

Apache

Apache

Apache

APACHE

Apache Spark