1 / 33

Apache Hadoop Ingestion Patterns & Apache Flume

Apache Hadoop Ingestion Patterns & Apache Flume. Ted Malaska. Agenda. Selecting an Ingestion Strategy Apache Flume High Level Components Flume’s Guarantees Common Architectures Detailed Configurations Performance Tuning Example. Selecting a Ingestion Strategy. Timeliness

keegan
Download Presentation

Apache Hadoop Ingestion Patterns & Apache Flume

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Hadoop Ingestion Patterns& Apache Flume Ted Malaska

  2. Agenda • Selecting an Ingestion Strategy • Apache Flume • High Level Components • Flume’s Guarantees • Common Architectures • Detailed Configurations • Performance Tuning • Example

  3. Selecting a Ingestion Strategy • Timeliness • Append or Delta • Access Patterns • Original Source System • Network Concerns • Transformation, Partitioning, and Bifurcation

  4. Timeliness • Macro Batch: 15 minutes to hours • Micro Batch: 4 minutes to 15 minutes • Mini Micro Batch: Under 4 minutes but greater then 30 seconds • Near Real Time Decision Support: Under 30 second but over 2 seconds • Near Real Time Event Processing: Down to about 100 to 200 milliseconds • Real Time: 

  5. Append or Delta • Existing Data is Immutable • Existing Data is Mutable for a Fixed Window • Existing Data is Always Mutable

  6. Access Patterns • Batch • MR • Hive • Pig • Crunch • Graph • Time of Thought or NRT • Impala • Search • Get, Put, Scan

  7. Original Source System • File System • RDBMS • Stream • Log Files

  8. Network Concerns • Security • Bandwidth and Compression

  9. Transformation, Partitioning, and Bifurcation • Transformation: Converting XML or JSON to delimiter data. • Partitioning: Incoming data is stock trade data and partitioning by ticker is required • Bifurcation: The data needs to land in HDFS and HBase for different access patterns

  10. Apache Flume • History • Scribe • Flume • Flume NG

  11. High Level Components HDFS Avro Client HBase JMS Sources Point A Interceptors Selectors Channels Sinks Point B

  12. Sources • AvroSource • HTTPSource • NetcatSource • SpoolDirectorySource • ExecSource • JMSSource • ThriftSource • SyslogTcpSource • SyslogUDPSource

  13. Interceptors • RegexExtractorInterceptor • TimestampInterceptor • StaticInterceptor • HostInterceptor • Custom

  14. Selectors • MultoplexingChannelSelector • ReplicatingChannelSelector • Custom

  15. Channel • FileChannel • MemoryChannel

  16. Sinks • HDFSEventSink • HBaseSink • AsyncHBaseSink • NullSink • RollingFileSink • AvroSink • ThriftSink • MorphlineSink • ElasticSearchSink

  17. Flume’s Guarantees • There is no such thing as 100% guarantees • Flume offers several level of configurable guarantees • This is done through transactions

  18. Flume’s Guarantees (Transactions 1 of 3) Submit a Batch Flume Agent Avro Client Confirm Batch With Guarantees

  19. Flume’s Guarantees (Transactions 2 of 3) HDFS Avro Client HBase JMS Sources Point A Interceptors Selectors Channels Sinks Point B

  20. Flume’s Guarantees (Transactions 3 of 3) • Memory Channel: Best Effort • File Channel: JBOD • File Channel: Raid • File Channel: NAS or SAN

  21. Common Architectures (Fan In) HDFS

  22. Common Architectures (Bifurcation) HDFS HDFS DR

  23. Common Architectures (Alerting or Partitioning) HDFS Partition 1 Partition 2 HBase

  24. Detailed Configurations: Avro Source & Client • Bind and port • Threads • Batch Size • Compression • SSL Encryption • IP Filtering

  25. Detailed Configurations: JMS Source • Connection Factory • Provided URL • Destination Name • Destiniation Type (queue or topic) • Message Selector • User Name • Password File • Batch Size

  26. Detailed Configurations: FileChannel • User home • Data Directories • Capacity • Keep alive • Transaction Capacity • Checkpointing • Directory • Use Dual Checkpoints • Backup checkpoint directory • Checkpoint Interval • Max file size • Minimum required space • useFastReplay • encryptionActiveKey & encryptionCipherProvider

  27. Detailed Configurations: MemoryChannel • Capacity • transactionCapacity • byteCapacity • byteCapacityBufferPercentage • Keep-Alive

  28. Example of Configuration: HDFSEventSink(1 of 3) • hdfs.path • hdfs.filePrefix • Hdfs.inUsePrefix • Hdfs.inUseSuffix • Hdfs.rollInterval • Hdfs.rollCount • Hdfs.rollSize • Hdfs.codeC • Hdfs.fileType • Hdfs.idleTimeout • Hdfs.batchSize • ThreadPoolSize

  29. Example of Configuration: HDFSEventSink (2 of 3) • Path Escaping • Using Headers to partition data Alias Description %{host} Substitute value of event header named “host”. Arbitrary header names are supported. %t Unix time in milliseconds %a locale’s short weekday name (Mon, Tue, ...) %A locale’s full weekday name (Monday, Tuesday, ...) %b locale’s short month name (Jan, Feb, ...) %B locale’s long month name (January, February, ...) %c locale’s date and time (Thu Mar 3 23:05:25 2005) %d day of month (01) %D date; same as %m/%d/%y %H hour (00..23) %I hour (01..12) %j day of year (001..366) %k hour ( 0..23) %m month (01..12) %M minute (00..59) %p locale’s equivalent of am or pm %s seconds since 1970-01-01 00:00:00 UTC %S second (00..59) %y last two digits of year (00..99) %Y year (2010) %z +hhmm numeric timezone (for example, -0400)

  30. Example of Configuration: HDFSEventSink (2 of 3) • File Formats and Compression • Text Files • Sequence Files • Avro Files • Can’t Use Columnar File Types • RC • Parquet

  31. Example of Configuration: HBaseSink • Table name • Column Family • Batch size • Hbase user • kerberosPrincipal & kerberosKeytab • enabledWal • Serializer

  32. Example of Configuration: AsyncHBaseSink • Table name • Column Family • Batch size • Hbase user • enabledWal • Serializer

  33. Thank you!

More Related