1 / 45

Stinger Initiative: Deep Dive

Stinger Initiative: Deep Dive. Interactive Query on Hadoop. Chris Harris E-Mail : charris@hortonworks.com Twitter : cj_harris5. Agenda. Key Hive Use Cases Brief Refresher on Hive The Stinger Initiative: Interactive Query for Hive. Key Hive Use Cases. RDBMS / MPP Offload

moanna
Download Presentation

Stinger Initiative: Deep Dive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stinger Initiative: Deep Dive Interactive Query on Hadoop Chris Harris E-Mail : charris@hortonworks.com Twitter : cj_harris5

  2. Agenda • Key Hive Use Cases • Brief Refresher on Hive • The Stinger Initiative: Interactive Query for Hive

  3. Key Hive Use Cases • RDBMS / MPP Offload • More data under query. • Database unable to keep up with SLAs. • Analysis of semi-structured data. • ETL / Data Refinement • +++ Increasingly: Business Intelligence and interactive query

  4. BI Use Cases Enterprise Reports Dashboard / Scorecard Parameterized Reports Visualization Data Mining

  5. Organize Tiers and Process with Metadata Access Tier Pig Conform, Summarize, Access HiveQL Gold Tier • Organize data based on source/derived relationships • Allows for fault and rebuild process HCat Pig Transform, Integrate, Storage MapReduce Work Tier Provides unified metadata access to Pig, Hive & MapReduce Pig Standardize, Cleanse, Transform MapReduce Raw Tier WebHDFS Extract & Load Flume Sqoop

  6. Hive Current Focus Area Real-Time Interactive Non-Interactive Batch • Online systems • R-T analytics • CEP • Parameterized Reports • Drilldown • Visualization • Exploration • Data preparation • Incremental batch processing • Dashboards / Scorecards • Operational batch processing • Enterprise Reports • Data Mining Current Hive Sweet Spot 1m – 1h 1h+ 0-5s 5s – 1m Data Size

  7. Stinger: Extending Hive’s Sweetspot Real-Time Interactive Non-Interactive Batch Current Hive Sweet Spot • Online systems • R-T analytics • CEP • Parameterized Reports • Drilldown • Visualization • Exploration • Data preparation • Incremental batch processing • Dashboards / Scorecards • Operational batch processing • Enterprise Reports • Data Mining Future Hive Expansion 1m – 1h 1h+ 0-5s 5s – 1m Data Size • Improve Latency & Throughput • Query engine improvements • New “Optimized RCFile” column store • Next-gen runtime (elim’s M/R latency) • Extend Deep Analytical Ability • Analytics functions • Improved SQL coverage • Continued focus on core Hive use cases

  8. The top BI vendors support Hive today

  9. Agenda • Key Hive Use Cases • Brief Refresher on Hive • The Stinger Initiative: Interactive Query for Hive

  10. Brief Refresher on Hive The State of Hive Today (0.10)

  11. Hive’s Origins Hive was originally developed at Facebook. More data than existing RDBMS could handle. 60,000+ Hive queries per day. More than 1,000 users per day. 100+ PB of data. 15+ TB of data loaded daily. Hive is a proven solution at extreme scale.

  12. Hive 0.10 Capabilities • De-facto SQL Interface for Hadoop • Multiple persistence options: • Flat text for simple data imports. • Columnar format (RCFile) for high performance processing. • Secure and concurrent remote access • ODBC/JDBC connectivity • Highly extensible: • Supports User Defined Functions and User Defined Aggregation Functions. • Ships with more than 150 UDF/UDAF. • Extensible readers/writers can process any persisted data. • Support from 10+ BI vendors

  13. HDP 1.2: ODBC Access for Popular BI Tools • Seamless integration with BI tools such as Excel, PowerPivot, MicroStrategy, and Tableau • Efficiently maps advanced SQL functionality into HiveQL • With configurable pass-through of HiveQL for Hive-aware apps • ODBC 3.52 standard compliant • Supports Linux & Windows Applications & Spreadsheets Visualization & Intelligence ODBC Hortonworks Data Platform High quality ODBC driver developed in partnership with Simba. Free to download & use with Hortonworks Data Platform.

  14. 0 to Big Data in 15 Minutes Hands on tutorials integrated into Sandbox HDP environment for evaluation

  15. Agenda • Brief Refresher on Hive • Key Hive Use Cases • The Stinger Initiative: Interactive Query for Hive

  16. The Stinger Initiative Interactive Query on Hadoop

  17. Stinger Initiative: 2-Pronged Approach Making Hive Best for Interactive Query Improve Latency and Throughput Extend Deep Analytical Ability • Tez • New primitives move beyond map-reduce and beyond batch • Avoid unnecessary persistence of temporary data • Hive, Pig and others generate Tezplans for high perf • Query Engine Improvements • Cost-based optimizer • In-memory joins • Caching hot tables • Vector processing • State-of-the-art Column Store • “Optimized RCFile” or ORCFile • Minimizes disk IO and deserialization • TezService • Always-on service for query interactivity • Analytics Functions • SQL:2003 Compliant • OVER with PARTITION BY and ORDER BY • Wide variety of windowing functions: • RANK • LEAD/LAG • ROW_NUMBER • FIRST_VALUE • LAST_VALUE • Many more • Aligns well with BI ecosystem • Improved SQL Coverage • Non-correlated Subqueries using IN in WHERE • Expanded SQL types including DATETIME, VARCHAR, etc.

  18. Hive: Performance Improvements

  19. Stinger Initiative At A Glance

  20. Base Optimizations: Intelligent Optimizer • Introduction of In-Memory Hash Join: • For joins where one side fits in memory: • New in-memory-hash-join algorithm. • Hive reads the small table into a hash table. • Scans through the big file to produce the output. • Introduction of Sort-Merge-Bucket Join: • Applies when tables are bucketed on the same key. • Dramatic speed improvements seen in benchmarks. • Other Improvements: • Lower the footprint of the fact tables in memory. • Enable the optimizer to automatically pick map joins.

  21. Dimensionally Structured Data • Extremely common pattern in EDW. • Results in large “fact tables” and small “dimension tables”. • Dimension tables often small enough to fit in RAM. • Sometimes called Star Schema.

  22. A Query on Dimensional Data • Derived from TPC-DS Query 27 • Dramatic speedup on Hive 0.11 SELECT col5, avg(col6) FROM fact_table join dim1 on (fact_table.col1 = dim1.col1) join dim2 on (fact_table.col2 = dim2.col1) join dim3 on (fact_table.col3 = dim3.col1) join dim4 on (fact_table.col4 = dim4.col1) GROUP BY col5 ORDER BY col5 LIMIT 100;

  23. Star Schema Join Improvements in 0.11

  24. Hive: Bucketing • Bucketing causes Hive to physically co-locate rows within files. • Buckets can be sorted or unsorted. CREATE EXTERNAL TABLE IF NOT EXISTS test_table ( Id INT, name String ) PARTITIONED BY (dt STRING, hour STRING) CLUSTERED BY(country,continent) SORTED BY(country,continent) INTO n BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/home/test_dir';

  25. ORCFile - Optimized Column Storage • Make a better columnar storage file • Tightly aligned to Hive data model • Decompose complex row types into primitive fields • Better compression and projection • Only read bytes from HDFS for the required columns. • Store column level aggregates in the files • Only need to read the file meta information for common queries • Stored both for file and each section of a file • Aggregates: min, max, sum, average, count • Allows fast access by sorted columns • Ability to add bloom filters for columns • Enables quick checks for whether a value is present

  26. Performance Futures - Vectorization • Operates on blocks of 1K or more records, rather than one record at a time • Each block contains an array of Java scalars, one for each column • Avoids many function calls, virtual dispatch, CPU pipeline stalls • Size to fit in L1 cache, avoid cache misses • Generate code for operators on the fly to avoid branches in code, maximize deep pipelines of modern processers • Up to 30x faster processing of records • Beta possible in 2H 2013

  27. Performance Futures – Cost-Based Optimizer • Generate more intelligent DAGs based on properties of data being queried, e.g. table size, statistics, histograms, etc.

  28. Performance Futures - Buffering • Query workloads always have hotspots: • Metadata • Small dimension tables • Build into YARN or Tez Service ways of buffering frequently used data into memory so it is not always read from disk. • Part of the “last mile” of latency efforts.

  29. Yarn Moving Hive and Hadoop beyond MapReduce

  30. Hadoop 2.0 Innovations - YARN • Focus on scale and innovation • Support 10,000+ computer clusters • Extensible to encourage innovation • Next generation execution • Improves MapReduce performance • Supports new frameworks beyond MapReduce • Low latency, Streaming, Services • Do more with a single Hadoop cluster Graph Processing Other MapReduce Tez YARN: Cluster Resource Management HDFS Redundant, Reliable Storage

  31. Tez Moving Hive and Hadoop beyond MapReduce

  32. Tez • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, Cascading etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS • Much lighter disk and network usage • Built on YARN

  33. Tez - Core Idea Task with pluggable Input, Processor & Output Processor Output Input Task Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks

  34. Tez – Blocks for building tasks MapReduce ‘Map’ MapReduce ‘Reduce’ Map Processor Reduce Processor Reduce Processor Sorted Output HDFS Output Sorted Output HDFS Input Shuffle Input Shuffle Input Intermediate ‘Reduce’ for Map-Reduce-Reduce MapReduce ‘Map’ Task MapReduce ‘Reduce’ Task Intermediate ‘Reduce’ for Map-Reduce-Reduce

  35. Tez – More tasks Special Pig/Hive ‘Map’ In-memory Map Map Processor Map Processor Reduce Processor Pipeline Sorter Output In-memory Sorted Output Sorted Output HDFS Input HDFSInput Shuffle Skip-merge Input Tez Task Tez Task Tez Task Special Pig/Hive ‘Reduce’

  36. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN bON (a.id = b.id) JOIN cON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - Tez Pig/Hive - MR

  37. FastQuery: Beyond Batch with YARN Always-On Tez Service Low latency processing for all Hadoop data processing Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently

  38. Tez Service • MR Query Startup Expensive • Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution • Tez Service • Removes task-launch overhead • Removes job-launch overhead • Hive/Pig • Submit query-plan to Tez Service • Native Hadoop service, not ad-hoc

  39. Tez Service Delivers Low Latency SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state * Numbers for illustration only

  40. Recap and Questions: Hive Performance

  41. Improving Hive’s SQL Support

  42. Stinger: Deep Analytical Capabilities • SQL:2003 Window Functions • OVER clauses • Multiple PARTITION BY and ORDER BY supported • Windowing supported (ROWS PRECEDING/FOLLOWING) • Large variety of aggregates • RANK • FIRST_VALUE • LAST_VALUE • LEAD / LAG • Distrubutions

  43. Hive Data Type Conformance • Data Types: • Add fixed point NUMERIC and DECIMAL type (in progress) • Add VARCHAR and CHAR types with limited field size • Add DATETIME • Add size ranges from 1 to 53 for FLOAT • Add synonyms for compatibility • BLOB for BINARY • TEXT for STRING • REAL for FLOAT • SQL Semantics: • Sub-queries in IN, NOT IN, HAVING. • EXISTS and NOT EXISTS

  44. Questions?

  45. Thank You! Questions & Answers

More Related