apache samza reliable stream processing atop apache kafka and yarn n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn PowerPoint Presentation
Download Presentation
Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn

Loading in 2 Seconds...

play fullscreen
1 / 100
drago

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn - PowerPoint PPT Presentation

301 Views
Download Presentation
Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Apache Samza*Reliable Stream Processing atop Apache Kafka and Yarn Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1 * Incubating

  2. Agenda • Why Stream Processing? • What is Samza’s Design ? • How is Samza’s Design Implemented? • How can you use Samza? • Example usage at Linkedin

  3. Why Stream Processing?

  4. 0 ms Response latency

  5. RPC 0 ms Response latency Synchronous

  6. RPC 0 ms Response latency Later. Possibly much later. Synchronous

  7. Samza RPC 0 ms Response latency Milliseconds to minutes Later. Possibly much later. Synchronous

  8. Newsfeed Ad Relevance

  9. Search Index Metrics and Monitoring

  10. What is Samza’s Design ?

  11. Stream A Stream B JOB Stream C

  12. Stream D Stream A Stream B Stream E JOB 1 JOB 2 JOB 3 Stream C Stream F Stream G

  13. Streams Partition 0 Partition 1 Partition 2

  14. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  15. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  16. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  17. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  18. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  19. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 next append

  20. Jobs Stream A Stream B Task 1 Task 2 Task 3 Stream C

  21. Jobs AdViews AdClicks Task 1 Task 2 Task 3 AdClickThroughRate

  22. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  23. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  24. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  25. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  26. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  27. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  28. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 1 Partition 0

  29. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 1 Partition 0

  30. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0

  31. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0

  32. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0

  33. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  34. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  35. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  36. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  37. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  38. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  39. Dataflow Stream A Stream B Stream C Job 1 Job 2 Stream D Stream E Job 3 Stream B

  40. Dataflow Stream A Stream B Stream C Job 1 Job 2 Stream D Stream E Job 3 Stream B

  41. Stateful Processing • Windowed Aggregation • Counting the number of page views for each user per hour • Stream Stream Join • Join stream of ad clicks to stream of ad views to identify the view that lead to the click • Stream Table Join • Join user region info to stream of page views to create an augmented stream

  42. How do people do this? • In memory state with checkpointing • Periodically save out the task’s in memory data • As state grows becomes very expensive • Some implementation checkpoints diffs but adds complexity

  43. How do people do this? • Using an external store • Push state to an external store • Performance suffers because of remote queries • Lack of isolation • Limited query capabilities

  44. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B

  45. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B

  46. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

  47. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

  48. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

  49. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream