1 / 12

HIVE

HIVE. Bucharest Big Data Meetup March 10, 2015. w hoami. Developer with SQL Server team since 2001 Apache contributor Hive Hadoop core (security) s tackoverflow user 105929s @ rusanu. What is Hive. Datawarehouse for querying and managing large datasets

makedar
Download Presentation

HIVE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HIVE Bucharest Big Data Meetup March 10, 2015

  2. whoami • Developer with SQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu

  3. What is Hive • Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at FacebookVLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010

  4. How does Hive work Beeswax JDBC Shell ODBC • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HCatalog Hive CLI Hiveserver2 Metastore RDBMS Driver Compiles, Optimizes Hadoop MapReduce HDFS Job Tracker Task Split Task Split

  5. Hive Query execution • Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)

  6. Hive features • Data types: • Numeric: tinyint, smallint, int, bigint, float, double, decimal(precision, scale) • Date/Time: timestamp, date • Character types: string, char(size), varchar(size) • Misc. types: Boolean, binary • Complex types: ARRAY<type>, MAP<type, type>, STRUCT<name:type, name:type>, UNIONTYPE<type, type, type> • Storage formats: text, sequencefile, ORC, Parquet, RC, arbitrary SerDe • Data Load: INSERT, LOAD, external tables, dynamic partitioning • Bucketized tables • JOIN optimizations: MapJoin, SMB join • ACID DML (INSERT/UPDATE/DELETE) • Only supported for ORC storage for now • Columnar storage, vectorized execution • Cost based optimizer (new) • HiveQL: SQL dialect, drives toward ANSI-92 compliance • Subqueries, joins, common table expressions • Lateral views (CROSS APPLY) • SELECT … FROM table LATERAL VIEW explode(column) • Windowing and analytical functions • LEAD, LAG, FIRST_VALUE, LAST_VALUE • RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE • OVER clause for aggregates • PARTITION BY, ORDER BY • WINDOW specification • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING) • WINDOW clause • SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) • GROUPING SETs, CUBE, ROLLUP • XPath, custom UDF • TRANSFORM: arbitrary map_script, reduce_script

  7. Hive engines • MapReduce • Default, widely available • Complex queries require stages -> stop-and-go • Always on disk shuffle • TEZ • Generalized MRR (DAG) • Pipelining • Memory shuffle • JOINs (bipartite) • Custom sort • HIVE can optimize plans for TEZ • Recommended engine • SPARK • HIVE-7292 • In development • Not to be confused with Shark or Spark-SQL

  8. Hive pros and cons • Capable of handling PB scale • Decent performance • Fairly advanced SQL features • Integrates with Hadoop ecosystem • Share the data (HDFS) • Leverage existing clusters • High Availability • Disaster Recoverability • Partitioning, Clustering, Bucketing • Positive momentum, active development • No licensing costs • Not good for ad-hoc due to high latency (job submit time) • Topic is actively pursued by Hive/Tez/Yarn • ANSI-SQL gaps • Poor toolset (hivecli, ODBC drivers) • In-house RDBMS operating expertize does not translate to Hive • Outperformed by (costly) high-end proprietary solutions

  9. How to use Hive • As a part of the ETL in transforming large data ingress (click-stream, mobile uploads, access log etc) into query able form • Alternative to PIG for those that favor SQL • Run one-off queries to analyze large unstructured data sets • Power of SQL to get insight into ‘collect everything’ • DW/BI • Can also be part of the ETL that loads the DW • Deploy on TEZ not on M/R • Use ORC or as storage Parquet format • Use recent releases (Hive 0.14 or later)

  10. When to avoid Hive • Replace RDBMS/OLTP • Ad-hoc BI (latency still too high, will improve soon) • When the dataset is small • 512 GB RAM is cheap • If it fits in memory, is not Big data • When data changes frequently • If you have an infinite budget

  11. Some alternatives to Hive • Columnar storage in a traditional RDBMS • MySQL: ICE, InfiniDB • PostgreSQL: cstore_fwd (Citus) • SQL Server columnstore • Amazon Red-shift • Azure SQL Database v12 • Impala • Presto • Spark SQL • Note that Impala and Spark SQL can share Hive’s metastore

  12. Links • Hive Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual • Join strategies in Hive: https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf • Hive on Tez: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez • Hive on Spark: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

More Related