Apache Samza

What Is Apache Samza ? ● An asynchronous computational framework ● For distributed sub second stream processing ● Fault tolerance, isolation and stateful processing ● Open source / Apache 2.0 license ● Developed in Java and Scala ● Runs stand-alone or on YARN

Samza Use Cases ● Applications that require millisecond - second response – Streaming analytics – DDOS attack detection – Fraud detection – Metric anomaly detection – System notifications – Performance monitoring

Samza Users

Samza Partitioned Stream ● Samza uses streams to process data ● Collections of ordered immutable objects ● Each object uses a key-value pair ● Each stream is sharded into partitions ● This allows the architecture to scale

Samza API's ● High Level Streams API (Java) – Stream based processing API ● Low Level Task API (Java) – Message based processing API ● Table API – Random access by key data sources ● Testing Samza – Samza's testing Integration framework ● Samza SQL – Stream processing via SQL and UDF's ● Apache BEAM – Samza provides a Beam runner for application execution

Samza Architecture

Samza Architecture ● Application are broken down into tasks ● Each task consumes data from a stream partition ● Tasks are executed with containers ● A coordinator assigns tasks to containers ● Tasks checkpoint their last processed task offset ● Each task has its own state store for state management ● Samza replicates changes to local store in separate stream ● This allows later recovery of local stores

Samza Architecture ● Task container coordination

Samza Architecture ● Fault tolerance of state

Samza Architecture ● Incremental checkpointing

Samza Architecture ● State management

Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Samza

Apache Samza

Presentation Transcript

Apache Sandesha and Apache Axis2

Apache

Apache Samza * Stream Processing at LinkedIn

Apache

Apache

Apache

Apache

The apache

Apache

Apache Mesos

APACHE

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn

Apache POI

Apache

Apache

Apache

APACHE

Apache

Apache

Apache

APACHE

Apache