1 / 37

Apache Kafka

Apache Kafka. A distributed publish-subscribe messaging system Neha Narkhede, LinkedIn nnarkhede@ linkedin .com, 11/11/11. Outline. Introduction to pub-sub Kafka at LinkedIn Hadoop and Kafka Design Performance. What is pub sub ?. Consumer. Producer. publish(topic , msg ). subscribe.

hateya
Download Presentation

Apache Kafka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Kafka A distributed publish-subscribe messaging system Neha Narkhede, LinkedInnnarkhede@linkedin.com, 11/11/11

  2. Outline • Introduction to pub-sub • Kafka at LinkedIn • Hadoop and Kafka • Design • Performance

  3. What is pub sub ? Consumer Producer publish(topic, msg) subscribe Topic 1 Topic 2 Topic 3 msg Publish subscribe system Consumer Producer msg

  4. Outline • Introduction to pub-sub • Kafka at LinkedIn • Hadoop and Kafka • Design • Performance

  5. Motivation • Activity tracking • Operational metrics

  6. Kafka • Distributed • Persistent • High throughput

  7. Kafka at LinkedIn Frontend Frontend Service Service Broker Broker Broker Kafka Hadoop DWH Real time monitoring Security systems News feed

  8. Outline • Introduction to pub-sub • Kafka at LinkedIn • Hadoop and Kafka • Design • Performance

  9. Hadoop Data Load for Kafka Live data center Offline data center Frontend Kafka Frontend Kafka Kafka Real time consumers Kafka Kafka Hadoop Hadoop Kafka Hadoop Hadoop Dev Hadoop PROD Hadoop

  10. Multi DC data deployments Live data centers Offline data centers Real time consumers Kafka Hadoop Hadoop Real time consumers Hadoop Hadoop Real time consumers Hadoop Hadoop Hadoop DWH Real time consumers Kafka Real time consumers Real time consumers

  11. Volume • 20B events/day • 3 terabytes/day • 150K events/sec

  12. Message queues • ActiveMQ • TIBCO • Log aggregators • Flume • Scribe KAFKA • Focus on HDFS • Push model • No rewindable consumption • Low throughput • Secondary indexes • Tuned for low latency

  13. What Kafka offers • Very high performance • Elastically scalable • Low operational overhead • Durable, highly available (coming soon)

  14. Architecture Producer Producer Broker Broker ZK Broker Broker Consumer Consumer

  15. Outline • Introduction to pub-sub • Kafka at LinkedIn • Hadoop and Kafka • Design • Performance

  16. Efficiency #1: simple storage • Each topic has an ever-growing log • A log == a list of files • A message is addressed by a log offset

  17. Efficiency #2: careful transfer • Batch send and receive • No message caching in JVM • Rely on file system buffering • Zero-copy transfer: file -> socket

  18. Multi subscribers • 1 file system operation per request • Consumption is cheap • SLA based message retention • Rewindable consumption

  19. Guarantees • Data integrity checks • At least once delivery • In order delivery, per partition

  20. Automatic load balancing Producer Producer Broker Broker Consumer Consumer

  21. Auditing • # events published = # events consumed

  22. Outline • Introduction to pub-sub • Kafka at LinkedIn • Hadoop and Kafka • Design • Performance

  23. Performance • 2 Linux boxes • 16 2.0 GHz cores • 6 7200 rpm SATA drive RAID 10 • 24GB memory • 1Gb network link • 200 byte messages • Producer batch size 200 messages

  24. Basic performance metrics • Producer batch size = 40K • Consumer batch size = 1MB • 100 topics, broker flush interval = 100K • Producer throughput = 90 MB/sec • Consumer throughput = 60 MB/sec • Consumer latency = 220 ms

  25. Latency vs throughput (100 topics, 1 producer, 1 broker)

  26. Scalability (10 topics, broker flush interval 100K)

  27. Throughput vs Unconsumed data

  28. State of the system • 4 clusters per colo, 4 servers each • 850 socket connections per server • 20 TB • 430 topics • Batched frontend to offline datacenter latency => 6-10 secs • Frontend to Hadoop latency => 5 min

  29. State of the system • Successfully deployed in production at LinkedIn and other startups • Apache incubator inclusion • 0.7 Release • Compression • Cluster mirroring

  30. Replication

  31. Some project ideas • Security • Long poll • More compression codecs • Locality of consumption

  32. Team • Jay Kreps • Jun Rao • Neha Narkhede • Joel Koshy • Chris Burroughs

  33. http://incubator.apache.org/kafka/index.html kafka-users@incubator.apache.org http://www.linkedin.com/in/nehanarkhede @nehanarkhede #kafka Thank you

  34. API

More Related