introduction to hive
Download
Skip this Video
Download Presentation
Introduction to Hive

Loading in 2 Seconds...

play fullscreen
1 / 24

Introduction to Hive - PowerPoint PPT Presentation


  • 851 Views
  • Uploaded on

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive -and-hdfs [4] Facebook Data Team: ... .net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Hive' - Kelvin_Ajay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Introduction to HiveOutline
  • Motivation
  • Overview
  • Data Model / Metadata
  • Architecture
  • Performance
  • Cons and Pros
  • Application
  • Related Work

3/10/2014

motivation
Introduction to HiveMotivation

Realtime

Hadoop Cluster

Scribe MidTier

Web Servers

Scribe Writers

MySQL

Oracle RAC

Hadoop Hive Warehouse

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

3/10/2014

motivation4
Introduction to HiveMotivation
  • Limitation of MR
    • Have to use M/R model
    • Not Reusable
    • Error prone
    • For complex jobs:
      • Multiple stage of Map/Reduce functions
      • Just like ask dev to write specify physical execution plan in the database

3/10/2014

overview
Introduction to HiveOverview
  • Intuitive
    • Make the unstructured data looks like tables regardless how it really lay out
    • SQL based query can be directly against these tables
    • Generate specify execution plan for this query
  • What’s Hive
    • A data warehousing system to store structured data on Hadoop file system
    • Provide an easy query these data by execution Hadoop MapReduce plans

3/10/2014

data model
Introduction to HiveData Model
  • Tables
    • Basic type columns (int, float, boolean)
    • Complex type: List / Map ( associate array)
  • Partitions
  • Buckets
  • CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING>

) PARITIONED BY (ds STRING)

CLUSTERED BY (id) INTO 32 BUCKETS;

  • SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)

3/10/2014

metadata
Introduction to HiveMetadata
  • Database namespace
  • Table definitions
    • schema info, physical location In HDFS
  • Partition data
  • ORM Framework
    • All the metadata can be stored in Derby by default
    • Any database with JDBC can be configed

3/10/2014

architecture
Architecture

Map Reduce

HDFS

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

performance
Introduction to HivePerformance
  • GROUP BY operation
    • Efficient execution plans based on:
      • Data skew:
        • how evenly distributed data across a number of physical nodes
        • bottleneck VS load balance
      • Partial aggregation:
        • Group the data with the same group by value as soon as possible
        • In memory hash-table for mapper
        • Earlier than combiner

3/10/2014

performance10
Introduction to HivePerformance
  • JOIN operation
    • Traditional Map-Reduce Join
    • Early Map-side Join
      • very efficient for joining a small table with a large table
      • Keep smaller table data in memory first
      • Join with a chunk of larger table data each time
      • Space complexity for time complexity

7/20/2010

performance11
Introduction to HivePerformance
  • Ser/De
    • Describe how to load the data from the file into a representation that make it looks like a table;
  • Lazy load
    • Create the field object when necessary
    • Reduce the overhead to create unnecessary objects in Hive
    • Java is expensive to create objects
    • Increase performance

7/20/2010

hive performance
Hive – Performance
  • QueryA: SELECT count(1) FROM t;
  • QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
  • QueryC: SELECT * FROM t;
  • map-side time only (incl. GzipCodec for comp/decompression)
  • * These two features need to be tested with other queries.

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

slide13
Introduction to HivePros
  • Pros
    • A easy way to process large scale data
    • Support SQL-based queries
    • Provide more user defined interfaces to extend
    • Programmability
    • Efficient execution plans for performance
    • Interoperability with other database tools

3/10/2014

slide14
Introduction to HiveCons
  • Cons
    • No easy way to append data
    • Files in HDFS are immutable
  • Future work
    • Views / Variables
    • More operator
      • In/Exists semantic
    • More future work in the mail list

3/10/2014

application
Introduction to HiveApplication
  • Log processing
    • Daily Report
    • User Activity Measurement
  • Data/Text mining
    • Machine learning (Training Data)
  • Business intelligence
    • Advertising Delivery
    • Spam Detection

7/20/2010

related work
Introduction to HiveRelated Work
  • Parallel databases: Gamma, Bubba, Volcano
  • Google: Sawzall
  • Yahoo: Pig
  • IBM: JAQL
  • Microsoft: DradLINQ , SCOPE

7/20/2010

reference
Introduction to HiveReference
  • [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09\', 2009.
  • [2] Hadoop 2009:
    • http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs
  • [4] Facebook Data Team:
    • http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation
  • [3] Cloudera:
    • http://www.cloudera.com/videos/introduction_to_hive

7/20/2010

hive components
Introduction to HiveHive Components
  • Shell Interface: Like the MySQL shell
  • Driver:
    • Session handles, fetch, exeucition
  • Complier:
    • Prarse,plan,optimzie
  • Execution Engine:
    • DAG stage
    • Run map or reduce

7/20/2010

motivation21
Introduction to HiveMotivation
  • MapReduce Motivation
    • Data processing: > 1 TB
    • Massively parallel
    • Locality
    • Fault Tolerant

7/20/2010

hive usage
Introduction to HiveHive Usage
  • hive> show tables;
  • hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile
  • hive> load data inpath “shakespeare_freq” into table shakespeare;
hive usage23
Introduction to HiveHive Usage
  • hive> load data inpath “shakespeare_freq” into table shakespeare;
  • hive> select * from shakespeare where freq>100 sort by freq asc limit 10;
hive usage @ facebook
Introduction to HiveHive Usage @ Facebook
  • Statistics per day:
    • 4 TB of compressed new data added per day
    • 135TB of compressed data scanned per day
    • 7500+ Hive jobs on per day
  • Hive simplifies Hadoop:
    • ~200 people/month run jobs on Hadoop/Hive
    • Analysts (non-engineers) use Hadoop through Hive
    • 95% of jobs are Hive Jobs

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

7/20/2010

ad