Introduction to hive
Download
1 / 24

Introduction to Hive - PowerPoint PPT Presentation


  • 846 Views
  • Uploaded on

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive -and-hdfs [4] Facebook Data Team: ... .net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Hive' - Kelvin_Ajay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to hive l.jpg

Introduction to Hive

Liyin Tang

[email protected]


Outline l.jpg

Introduction to Hive

Outline

  • Motivation

  • Overview

  • Data Model / Metadata

  • Architecture

  • Performance

  • Cons and Pros

  • Application

  • Related Work

3/10/2014


Motivation l.jpg

Introduction to Hive

Motivation

Realtime

Hadoop Cluster

Scribe MidTier

Web Servers

Scribe Writers

MySQL

Oracle RAC

Hadoop Hive Warehouse

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

3/10/2014


Motivation4 l.jpg

Introduction to Hive

Motivation

  • Limitation of MR

    • Have to use M/R model

    • Not Reusable

    • Error prone

    • For complex jobs:

      • Multiple stage of Map/Reduce functions

      • Just like ask dev to write specify physical execution plan in the database

3/10/2014


Overview l.jpg

Introduction to Hive

Overview

  • Intuitive

    • Make the unstructured data looks like tables regardless how it really lay out

    • SQL based query can be directly against these tables

    • Generate specify execution plan for this query

  • What’s Hive

    • A data warehousing system to store structured data on Hadoop file system

    • Provide an easy query these data by execution Hadoop MapReduce plans

3/10/2014


Data model l.jpg

Introduction to Hive

Data Model

  • Tables

    • Basic type columns (int, float, boolean)

    • Complex type: List / Map ( associate array)

  • Partitions

  • Buckets

  • CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING>

    ) PARITIONED BY (ds STRING)

    CLUSTERED BY (id) INTO 32 BUCKETS;

  • SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)

3/10/2014


Metadata l.jpg

Introduction to Hive

Metadata

  • Database namespace

  • Table definitions

    • schema info, physical location In HDFS

  • Partition data

  • ORM Framework

    • All the metadata can be stored in Derby by default

    • Any database with JDBC can be configed

3/10/2014


Architecture l.jpg
Architecture

Map Reduce

HDFS

http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs


Performance l.jpg

Introduction to Hive

Performance

  • GROUP BY operation

    • Efficient execution plans based on:

      • Data skew:

        • how evenly distributed data across a number of physical nodes

        • bottleneck VS load balance

      • Partial aggregation:

        • Group the data with the same group by value as soon as possible

        • In memory hash-table for mapper

        • Earlier than combiner

3/10/2014


Performance10 l.jpg

Introduction to Hive

Performance

  • JOIN operation

    • Traditional Map-Reduce Join

    • Early Map-side Join

      • very efficient for joining a small table with a large table

      • Keep smaller table data in memory first

      • Join with a chunk of larger table data each time

      • Space complexity for time complexity

7/20/2010


Performance11 l.jpg

Introduction to Hive

Performance

  • Ser/De

    • Describe how to load the data from the file into a representation that make it looks like a table;

  • Lazy load

    • Create the field object when necessary

    • Reduce the overhead to create unnecessary objects in Hive

    • Java is expensive to create objects

    • Increase performance

7/20/2010


Hive performance l.jpg
Hive – Performance

  • QueryA: SELECT count(1) FROM t;

  • QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;

  • QueryC: SELECT * FROM t;

  • map-side time only (incl. GzipCodec for comp/decompression)

  • * These two features need to be tested with other queries.

    http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs


Slide13 l.jpg

Introduction to Hive

Pros

  • Pros

    • A easy way to process large scale data

    • Support SQL-based queries

    • Provide more user defined interfaces to extend

    • Programmability

    • Efficient execution plans for performance

    • Interoperability with other database tools

3/10/2014


Slide14 l.jpg

Introduction to Hive

Cons

  • Cons

    • No easy way to append data

    • Files in HDFS are immutable

  • Future work

    • Views / Variables

    • More operator

      • In/Exists semantic

    • More future work in the mail list

3/10/2014


Application l.jpg

Introduction to Hive

Application

  • Log processing

    • Daily Report

    • User Activity Measurement

  • Data/Text mining

    • Machine learning (Training Data)

  • Business intelligence

    • Advertising Delivery

    • Spam Detection

7/20/2010


Related work l.jpg

Introduction to Hive

Related Work

  • Parallel databases: Gamma, Bubba, Volcano

  • Google: Sawzall

  • Yahoo: Pig

  • IBM: JAQL

  • Microsoft: DradLINQ , SCOPE

7/20/2010


Reference l.jpg

Introduction to Hive

Reference

  • [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009.

  • [2] Hadoop 2009:

    • http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

  • [4] Facebook Data Team:

    • http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-presentation

  • [3] Cloudera:

    • http://www.cloudera.com/videos/introduction_to_hive

7/20/2010


Q a thank you l.jpg

Q & AThank you



Hive components l.jpg

Introduction to Hive

Hive Components

  • Shell Interface: Like the MySQL shell

  • Driver:

    • Session handles, fetch, exeucition

  • Complier:

    • Prarse,plan,optimzie

  • Execution Engine:

    • DAG stage

    • Run map or reduce

7/20/2010


Motivation21 l.jpg

Introduction to Hive

Motivation

  • MapReduce Motivation

    • Data processing: > 1 TB

    • Massively parallel

    • Locality

    • Fault Tolerant

7/20/2010


Hive usage l.jpg

Introduction to Hive

Hive Usage

  • hive> show tables;

  • hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile

  • hive> load data inpath “shakespeare_freq” into table shakespeare;


Hive usage23 l.jpg

Introduction to Hive

Hive Usage

  • hive> load data inpath “shakespeare_freq” into table shakespeare;

  • hive> select * from shakespeare where freq>100 sort by freq asc limit 10;


Hive usage @ facebook l.jpg

Introduction to Hive

Hive Usage @ Facebook

  • Statistics per day:

    • 4 TB of compressed new data added per day

    • 135TB of compressed data scanned per day

    • 7500+ Hive jobs on per day

  • Hive simplifies Hadoop:

    • ~200 people/month run jobs on Hadoop/Hive

    • Analysts (non-engineers) use Hadoop through Hive

    • 95% of jobs are Hive Jobs

      http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

7/20/2010


ad