1 / 22

Fast Data Made Easy

Learn how to handle fast data processing and storage using Kafka and Kudu, with examples and insights from Cloudera experts Ted Malaska and Jeff Holoman.

Download Presentation

Fast Data Made Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Data Made Easy Ted Malaska Cloudera • With Kafka and Kudu • @jeffholoman @tedmalaska Jeff Holoman Cloudera

  2. A little History

  3. Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • 100% Correct! txns 100 insert RDBMS Hadoop XML 101 insert SQL 100 update 100 update 102 insert

  4. Distributed Systems • Things Fail • Systems are designed to tolerate failure • We must expect failures and design our code and configure our systems to handle them

  5. Options • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • 100 % Correct Option 1 RDBMS Hadoop SQL Sqoop

  6. Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • 100% Correct Option 2 txns txns txns RDBMS txn Hadoop SQL • Compaction • De-Duplication • In-Process Hard

  7. “There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery ” --Mathias Verraes @mathiasverraes

  8. Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • Correct! Option 3a txns txns txns RDBMS txn HBase SQL + App • Compaction • Hbase->HDFS • Complex • Hbase ScANS Slow • Joins Hadoop

  9. Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • Correct! Option 3b txns txns txns RDBMS txn HBase Check SQL + App • Compaction • Complex Hadoop

  10. Bank Ledger • All txns must be queryable within 5 min SECONDS • XML must be parsed and reformatted • In-process counting • Correct! The new Option txns txns txns RDBMS txn SQL OR App • Free Exactly-Once • Immediately Available • Guaranteed ORDERING • UPDATES

  11. Apache Kudu (Incubating) • Columnar Datastore • Fast Inserts/Updates • Efficient Scans • Complements HDFS and HBase • Real-time Row-Based Storage A B C A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 Columnar Storage A3 B3 C3 A1 A2 A3 B1 B2 B3 C1 C2 C3

  12. Kudu Ledger Table create table `ledger` ( uuid STRING, transaction_id STRING, customer_id INT, source STRING, db_action STRING, time_utc STRING, `date` STRING, amount_dollars INT, amount_cents INT, local_timestamp BIGINT ) DISTRIBUTE BY HASH(transaction_id) INTO 20 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'ledger', 'kudu.master_addresses' = 'jhol-1.vpc.cloudera.com:7051', 'kudu.key_columns' = 'transaction_id,uuid', 'kudu.num_tablet_replicas' = '3') ;

  13. API

  14. Kudu Aggregation Demo CREATE EXTERNAL TABLE `gamer` ( `gamer_id` STRING, `last_time_played` BIGINT, `games_played` INT, `games_won` INT, `oks` INT, `deaths` INT, `damage_given` INT, `damage_taken` INT, `max_oks_in_one_game` INT, `max_deaths_in_one_game` INT ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'gamer', 'kudu.master_addresses' = 'ip-172-31-43-177.us-west-2.compute.internal:7051', 'kudu.key_columns' = 'gamer_id' );

  15. Kudu Aggregation Architecture Impala SparkSQL Generator Kafka Spark Streaming Kudu Spark MlLib SparkSQL

  16. Kudu Aggregation Demo DEMO

  17. Kudu Aggregation MlLib val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")val parsedData = resultDf.map(r => {val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble)Vectors.dense(array)})val dataCount = parsedData.count()if (dataCount > 0) {val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}

  18. Kudu CDC Demo CREATE EXTERNAL TABLE `gamer_cdc` ( `gamer_id` STRING, `eff_to` STRING, `eff_from` STRING, `last_time_played` BIGINT, `games_played` INT, `games_won` INT, `oks` INT, `deaths` INT, `damage_given` INT, `damage_taken` INT, `max_oks_in_one_game` INT, `max_deaths_in_one_game` INT ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'gamer_cdc', 'kudu.master_addresses' = 'ip-172-31-43-177.us-west-2.compute.internal:7051', 'kudu.key_columns' = 'gamer_id, eff_to' );

  19. Kudu Aggregation Architecture Get Gamer_Id + empty Eff_To yes no If Record Found? Put New Gamer_Id + Empty Eff_To Put Old Gamer_Id + New Eff_To Update Old Gamer_Id + Empty Eff_To

  20. Kudu Bitemporality Starting Point Insert New Eff_To Update Old Record to New

  21. Kudu CDC Demo Demo

More Related