introduction to hive l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Hive PowerPoint Presentation
Download Presentation
Introduction to Hive

Loading in 2 Seconds...

play fullscreen
1 / 24

Introduction to Hive - PowerPoint PPT Presentation


  • 878 Views
  • Uploaded on

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive -and-hdfs [4] Facebook Data Team: ... .net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Introduction to Hive


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Introduction to Hive Liyin Tang liyintan@usc.edu

    2. Introduction to Hive Outline • Motivation • Overview • Data Model / Metadata • Architecture • Performance • Cons and Pros • Application • Related Work 3/10/2014

    3. Introduction to Hive Motivation Realtime Hadoop Cluster Scribe MidTier Web Servers Scribe Writers MySQL Oracle RAC Hadoop Hive Warehouse http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html 3/10/2014

    4. Introduction to Hive Motivation • Limitation of MR • Have to use M/R model • Not Reusable • Error prone • For complex jobs: • Multiple stage of Map/Reduce functions • Just like ask dev to write specify physical execution plan in the database 3/10/2014

    5. Introduction to Hive Overview • Intuitive • Make the unstructured data looks like tables regardless how it really lay out • SQL based query can be directly against these tables • Generate specify execution plan for this query • What’s Hive • A data warehousing system to store structured data on Hadoop file system • Provide an easy query these data by execution Hadoop MapReduce plans 3/10/2014

    6. Introduction to Hive Data Model • Tables • Basic type columns (int, float, boolean) • Complex type: List / Map ( associate array) • Partitions • Buckets • CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING> ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS; • SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 3/10/2014

    7. Introduction to Hive Metadata • Database namespace • Table definitions • schema info, physical location In HDFS • Partition data • ORM Framework • All the metadata can be stored in Derby by default • Any database with JDBC can be configed 3/10/2014

    8. Architecture Map Reduce HDFS http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

    9. Introduction to Hive Performance • GROUP BY operation • Efficient execution plans based on: • Data skew: • how evenly distributed data across a number of physical nodes • bottleneck VS load balance • Partial aggregation: • Group the data with the same group by value as soon as possible • In memory hash-table for mapper • Earlier than combiner 3/10/2014

    10. Introduction to Hive Performance • JOIN operation • Traditional Map-Reduce Join • Early Map-side Join • very efficient for joining a small table with a large table • Keep smaller table data in memory first • Join with a chunk of larger table data each time • Space complexity for time complexity 7/20/2010

    11. Introduction to Hive Performance • Ser/De • Describe how to load the data from the file into a representation that make it looks like a table; • Lazy load • Create the field object when necessary • Reduce the overhead to create unnecessary objects in Hive • Java is expensive to create objects • Increase performance 7/20/2010

    12. Hive – Performance • QueryA: SELECT count(1) FROM t; • QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; • QueryC: SELECT * FROM t; • map-side time only (incl. GzipCodec for comp/decompression) • * These two features need to be tested with other queries. http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

    13. Introduction to Hive Pros • Pros • A easy way to process large scale data • Support SQL-based queries • Provide more user defined interfaces to extend • Programmability • Efficient execution plans for performance • Interoperability with other database tools 3/10/2014

    14. Introduction to Hive Cons • Cons • No easy way to append data • Files in HDFS are immutable • Future work • Views / Variables • More operator • In/Exists semantic • More future work in the mail list 3/10/2014

    15. Introduction to Hive Application • Log processing • Daily Report • User Activity Measurement • Data/Text mining • Machine learning (Training Data) • Business intelligence • Advertising Delivery • Spam Detection 7/20/2010

    16. Introduction to Hive Related Work • Parallel databases: Gamma, Bubba, Volcano • Google: Sawzall • Yahoo: Pig • IBM: JAQL • Microsoft: DradLINQ , SCOPE 7/20/2010

    17. Introduction to Hive Reference • [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009. • [2] Hadoop 2009: • http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs • [4] Facebook Data Team: • http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation • [3] Cloudera: • http://www.cloudera.com/videos/introduction_to_hive 7/20/2010

    18. Q & AThank you

    19. Back up

    20. Introduction to Hive Hive Components • Shell Interface: Like the MySQL shell • Driver: • Session handles, fetch, exeucition • Complier: • Prarse,plan,optimzie • Execution Engine: • DAG stage • Run map or reduce 7/20/2010

    21. Introduction to Hive Motivation • MapReduce Motivation • Data processing: > 1 TB • Massively parallel • Locality • Fault Tolerant 7/20/2010

    22. Introduction to Hive Hive Usage • hive> show tables; • hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile • hive> load data inpath “shakespeare_freq” into table shakespeare;

    23. Introduction to Hive Hive Usage • hive> load data inpath “shakespeare_freq” into table shakespeare; • hive> select * from shakespeare where freq>100 sort by freq asc limit 10;

    24. Introduction to Hive Hive Usage @ Facebook • Statistics per day: • 4 TB of compressed new data added per day • 135TB of compressed data scanned per day • 7500+ Hive jobs on per day • Hive simplifies Hadoop: • ~200 people/month run jobs on Hadoop/Hive • Analysts (non-engineers) use Hadoop through Hive • 95% of jobs are Hive Jobs http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs 7/20/2010