1 / 33

Data Loss and Data Duplication in Kafka

Data Loss and Data Duplication in Kafka. Jayesh Thakrar. Kafka is a distributed , partitioned , replicated , durable commit log service. It provides the functionality of a messaging system, but with a unique design. Exactly once - each message is delivered once and only once. AGENDA.

Download Presentation

Data Loss and Data Duplication in Kafka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Loss and Data Duplication in Kafka Jayesh Thakrar

  2. Kafka is a distributed, partitioned, replicated, durable commit log service. It provides the functionality of a messaging system, but with a unique design. Exactly once - each message is delivered once and only once

  3. AGENDA • Kafka Overview • Data Loss • Data Duplication • Data Loss and Duplicate Prevention • Monitoring

  4. Kafka Overview

  5. Kafka As A Log Abstraction Client: Producer Kafka Server = Kafka Broker Topic: app_events Client: Consumer B Client: Consumer A Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

  6. Topic Partitioning . . . Client: Producer or Consumer Kafka Broker Topic: app_events Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

  7. Topic Partitioning– Scalability Kafka Broker 0 Leader Replica Kafka Broker 2 Replica Replica Replica Leader Kafka Broker 1 Replica Clients: Producer, Consumer Leader Replica

  8. Topic Partitioning– redundancy Kafka Broker 0 Kafka Broker 2 Leader Replica Replica Replica Replica Leader Kafka Broker 1 Replica Client: Producer, Consumer Leader Replica

  9. Topic Partitioning – Redundancy/durability Kafka Broker 0 Kafka Broker 2 Leader Replica Replica Replica Replica Leader Kafka Broker 1 Pull-based inter-broker replication Replica Leader Replica

  10. Topic Partitioning – summary • Log sharded into partitions • Messages assigned to partitions by API or custom partitioner • Partitions assigned to brokers (manual or automatic) • Partitions replicated (as needed) • Messages ordered within each partition • Message offset = absolute position in partition • Partitions stored on filesystem as ordered sequence of log segments (files)

  11. Other Key Concepts • Cluster = collection of brokers • Broker-id = a unique id (integer) assigned to each broker • Controller = functionality within each broker responsible for leader assignment and management, with one being the active controller • Replica = partition copy, represented (identified) by the broker-id • Assigned replicas = set of all replicas (broker-ids) for a partition • ISR = In-Sync Replicas = subset of assigned replicas (brokers) that are “in-sync/caught-up”* with the leader (ISR always includes the leader)

  12. Data Loss

  13. Data Loss : Inevitable Upto 0.01% data loss For 700 billion messages / day, that's up to 7 million / day

  14. Data loss at the producer • Kafka Producer API • API Call-tree • kafkaProducer.send() • …. accumulator.append() // buffer • …. sender.send() // network I/O • Messages accumulate in buffer in batches • Batched by partition, retry at batch level • Expired batches dropped after retries • Error count and other metrics via JMX • Data Loss at Producer • Failure to close / flush producer on termination • Dropped batches due to communication or other errors when acks = 0 or retry exhaustion • Data produced faster than delivery, causing BufferExhaustedException(deprecated in 0.10+)

  15. dATA LOSS AT The CLUSTER (BY BROKERS) 1 Other replicas in ISR? 4 Was it a leader? Y Y Elect another leader Detected by Controller via zookeeper Broker Crashes N N 2 Y 5 6 N Was it in ISR? Relax, everything will be fine Allow unclean election? Other replicas available? Y Y 3 N N ISR >= min.insync.replicas? N Y Partition unavailable !! 7

  16. Non-leader broker crash 1 Other replicas in ISR? 4 Was it a leader? Y Y Elect another leader Detected by Controller via zookeeper Broker Crashes N N 2 Y 5 6 N Was it in ISR? Relax, everything will be fine Allow unclean election? Other replicas available? Y Y 3 N N ISR >= min.insync.replicas? N Y Partition unavailable !! 7

  17. Leader broker crash: Scenario 1 1 Other replicas in ISR? 4 Was it a leader? Y Y Elect another leader Detected by Controller via zookeeper Broker Crashes N N 2 Y 5 6 N Was it in ISR? Relax, everything will be fine Allow unclean election? Other replicas available? Y Y 3 N N ISR >= min.insync.replicas? N Y Partition unavailable !! 7

  18. Leader broker crash: Scenario 2 1 Other replicas in ISR? 4 Was it a leader? Y Y Elect another leader Detected by Controller via zookeeper Broker Crashes N N 2 Y 5 6 N Was it in ISR? Relax, everything will be fine Allow unclean election? Other replicas available? Y Y 3 N N ISR >= min.insync.replicas? N Y Partition unavailable !! 7

  19. dATA LOSS AT The CLUSTER (BY BROKERS) 1 Other replicas in ISR? 4 Was it a leader? Y Y Elect another leader Detected by Controller via zookeeper Broker Crashes N N 2 Y 5 6 N Was it in ISR? Relax, everything will be fine Allow unclean election? Other replicas available? Y Y 3 N N ISR >= min.insync.replicas? N Y Potential data-loss depending upon acksconfig at producer. See KAFKA-3919 KAFKA-4215 Partition unavailable !! 7

  20. FROM KAFKA-3919

  21. FROM KAFKA-4215

  22. Config for Data Durability and Consistency • Producer config - acks = -1 (or all) - max.block.ms (blocking on buffer full, default = 60000) and retries - request.timeout.ms (default = 30000) – it triggers retries • Topic config - min.insync.replicas = 2 (or higher) • Broker config- unclean.leader.election.enable = false - timeout.ms (default = 30000) – inter-broker timeout for acks

  23. Config for Availability and Throughput • Producer config- acks = 0 (or 1) - buffer.memory, batch.size, linger.ms (default = 100) - request.timeout.ms, max.block.ms (default = 60000), retries - max.in.flight.requests.per.connection • Topic config- min.insync.replicas = 1 (default) • Broker config- unclean.leader.election.enable = true

  24. Data Duplication

  25. Data Duplication: How it occurs Producer (API) retries = messages resent after timeout when retries > 1 Client: Producer Kafka Broker Topic: app_events Consumer consumes messages more than once after restart from unclean shutdown / crash Client: Consumer A Client: Consumer B

  26. Data Loss & Duplication Detection

  27. How to Detect Data loss & Duplication - 1 1) Msg from producer to Kafka 2) Ack from Kafka with details 3) Producer inserts into store 4) Consumer reads msg 5) Consumer validates msg If exists not duplicate consume msg delete msg If missing duplicate msg Audit: Remaining msgs in store are "lost" or "unconsumed" msgs 1 Producer Kafka Consumer 4 2 Memcache /HBase /Cassandra / Other Store 5 3 KEY | VALUE Topic, Partition, Offset | Msg Key or Hash

  28. How to Detect Data loss & Duplication - 2 1 Producer Kafka Consumer 4 1) Msg from producer to Kafka 2) Ack from Kafka with details 3) Producer maintains window stats 4) Consumer reads msg 5) Consumer validates window stats at end of interval 2 Memcache /HBase /Cassandra / Other Store 5 3 KEY | VALUE Source, time-window | Msg count or some other checksum (e.g. totals, etc)

  29. Data Duplication: How to minimize at consumer Client: Producer Kafka Broker Topic: app_events If possible, lookup last processed offset in destination at startup Client: Consumer A Client: Consumer B

  30. Monitoring

  31. Monitoring and Operations: JMX Metrics Producer JMX Consumer JMX

  32. Questions?

  33. Jayesh Thakrar jthakrar@conversantmedia.com

More Related