Introduction to Hive

Introduction to Hive Liyin Tang liyintan@usc.edu

Introduction to Hive Outline • Motivation • Overview • Data Model / Metadata • Architecture • Performance • Cons and Pros • Application • Related Work 10/20/2019

Introduction to Hive Motivation Realtime Hadoop Cluster Scribe MidTier Web Servers Scribe Writers MySQL Oracle RAC Hadoop Hive Warehouse http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html 10/20/2019

Introduction to Hive Motivation • Limitation of MR • Have to use M/R model • Not Reusable • Error prone • For complex jobs: • Multiple stage of Map/Reduce functions • Just like ask dev to write specify physical execution plan in the database 10/20/2019

Introduction to Hive Overview • Intuitive • Make the unstructured data looks like tables regardless how it really lay out • SQL based query can be directly against these tables • Generate specify execution plan for this query • What’s Hive • A data warehousing system to store structured data on Hadoop file system • Provide an easy query these data by execution Hadoop MapReduce plans 10/20/2019

Introduction to Hive Data Model • Tables • Basic type columns (int, float, boolean) • Complex type: List / Map ( associate array) • Partitions • Buckets • CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING> ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS; • SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 10/20/2019

Introduction to Hive Metadata • Database namespace • Table definitions • schema info, physical location In HDFS • Partition data • ORM Framework • All the metadata can be stored in Derby by default • Any database with JDBC can be configed 10/20/2019

Architecture Map Reduce HDFS http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

Introduction to Hive Performance • GROUP BY operation • Efficient execution plans based on: • Data skew: • how evenly distributed data across a number of physical nodes • bottleneck VS load balance • Partial aggregation: • Group the data with the same group by value as soon as possible • In memory hash-table for mapper • Earlier than combiner 10/20/2019

Introduction to Hive Performance • JOIN operation • Traditional Map-Reduce Join • Early Map-side Join • very efficient for joining a small table with a large table • Keep smaller table data in memory first • Join with a chunk of larger table data each time • Space complexity for time complexity 7/20/2010

Introduction to Hive Performance • Ser/De • Describe how to load the data from the file into a representation that make it looks like a table; • Lazy load • Create the field object when necessary • Reduce the overhead to create unnecessary objects in Hive • Java is expensive to create objects • Increase performance 7/20/2010

Hive – Performance • QueryA: SELECT count(1) FROM t; • QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; • QueryC: SELECT * FROM t; • map-side time only (incl. GzipCodec for comp/decompression) • * These two features need to be tested with other queries. http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

Introduction to Hive Pros • Pros • A easy way to process large scale data • Support SQL-based queries • Provide more user defined interfaces to extend • Programmability • Efficient execution plans for performance • Interoperability with other database tools 10/20/2019

Introduction to Hive Cons • Cons • No easy way to append data • Files in HDFS are immutable • Future work • Views / Variables • More operator • In/Exists semantic • More future work in the mail list 10/20/2019

Introduction to Hive Application • Log processing • Daily Report • User Activity Measurement • Data/Text mining • Machine learning (Training Data) • Business intelligence • Advertising Delivery • Spam Detection 7/20/2010

Introduction to Hive Related Work • Parallel databases: Gamma, Bubba, Volcano • Google: Sawzall • Yahoo: Pig • IBM: JAQL • Microsoft: DradLINQ , SCOPE 7/20/2010

Introduction to Hive Reference • [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009. • [2] Hadoop 2009: • http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs • [4] Facebook Data Team: • http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation • [3] Cloudera: • http://www.cloudera.com/videos/introduction_to_hive 7/20/2010

Q & AThank you

Back up

Introduction to Hive Hive Components • Shell Interface: Like the MySQL shell • Driver: • Session handles, fetch, exeucition • Complier: • Prarse,plan,optimzie • Execution Engine: • DAG stage • Run map or reduce 7/20/2010

Introduction to Hive Motivation • MapReduce Motivation • Data processing: > 1 TB • Massively parallel • Locality • Fault Tolerant 7/20/2010

Introduction to Hive Hive Usage • hive> show tables; • hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile • hive> load data inpath “shakespeare_freq” into table shakespeare;

Introduction to Hive Hive Usage • hive> load data inpath “shakespeare_freq” into table shakespeare; • hive> select * from shakespeare where freq>100 sort by freq asc limit 10;

Introduction to Hive Hive Usage @ Facebook • Statistics per day: • 4 TB of compressed new data added per day • 135TB of compressed data scanned per day • 7500+ Hive jobs on per day • Hive simplifies Hadoop: • ~200 people/month run jobs on Hadoop/Hive • Analysts (non-engineers) use Hadoop through Hive • 95% of jobs are Hive Jobs http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs 7/20/2010

Introduction to Hive