slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations PowerPoint Presentation
Download Presentation
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

Loading in 2 Seconds...

play fullscreen
1 / 22

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations - PowerPoint PPT Presentation


  • 207 Views
  • Uploaded on

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations. Thejas Nair pig team @ Yahoo! Apache pig PMC member. http://pig.apache.org. What is Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations' - colleen-beach


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

Thejas Nair

pig team @ Yahoo!

Apache pig PMC member

http://pig.apache.org

what is pig
What is Pig?

An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig Latin, a high level data processing language.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

pig latin example
Pig Latin example

Users = load‘users’as (name, age);

Fltrd = filter Users by age >= 18 and age <= 25;

Pages = load ‘pages’ as (user, url);

Jnd = join Fltrd by name, Pages by user;

comparison with mr in java
Comparison with MR in Java

1/20 the lines of code

1/16 the development time

What about Performance ?

pig compared to map reduce
Pig Compared to Map Reduce

Faster development time

Data flow versus programming logic

Many standard data operations (e.g. join) included

Manages all the details of connecting jobs and data flow

Copes with Hadoop version change issues

and you don t lose power
And, You Don’t Lose Power

UDFs can be used to load, evaluate, aggregate, and store data

External binaries can be invoked

Metadata is optional

Flexible data model

Nested data types

Explicit data flow programming

pig performance
Pig performance

Pigmix : pig vs mapreduce

pig optimization principles
Pig optimization principles

vs RDBMS: There is absence of accurate models for data, operators and execution env

Use available reliable info. Trust user choice.

Use rules that help in most cases

Rules based on runtime information

logical optimizations
Logical Optimizations

Parser

Logical Optimizer

Script

A = load

B = foreach

C = filter

Logical Plan

A -> B -> C

Optimized L. Plan

A -> C -> B

Restructure given logical dataflow graph

  • Apply filter, project, limit early
  • Merge foreach, filter statements
  • Operator rewrites
physical optimizations
Physical Optimizations

Translator

Optimizer

Optimized L. Plan

X -> Y -> Z

Phy/MR plan

M(PX-PYm) R(PYr)

->

M(Z)

Optimized Phy/MR Plan

M(PX-PYm) C(PYc)R(PYr)

->

M(Z)

Physical plan: sequence of MR jobs having physical operators.

  • Built-in rules. eg. use of combiner
  • Specified in query - eg. join type
hash join
Hash Join

Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user;

Map 1

Reducer 1

(1, user)

Pages

Users

Pages

block n

(1, fred)

(2, fred)

(2, fred)

Map 2

Reducer 2

Users

block m

(1, jane)

(2, jane)

(2, jane)

(2, name)

skew join
Skew Join

Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘skewed’;

Map 1

Reducer 1

SP

(1, user)

Pages

Users

Pages

block n

(1, fred, p1)

(1, fred, p2)

(2, fred)

SP

Map 2

Reducer 2

Users

block m

(1, fred, p3)

(1, fred, p4)

(2, fred)

(2, name)

merge join
Merge Join

Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘merge’;

Map 1

Pages

Users

Pages

Users

aaron…

amr

aaron

aaron

.

.

.

.

.

.

.

.

zach

aaron

.

.

.

.

.

.

zach

Map 2

Pages

Users

amy…

barb

amy

replicated join
Replicated Join

Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘replicated’;

Map 1

Pages

Pages

Users

Users

aaron

aaron

.

.

.

.

.

.

.

zach

aaron

.

zach

aaron…

amr

aaron

.

zach

Map 2

Pages

Users

aaron

.

zach

amy…

barb

group cogroup optimizations
Group/cogroup optimizations
  • On sorted and ‘collected’ data
  • grp = group Users by name using ‘collected’;

Pages

Map 1

aaron

aaron

barney

carol

.

.

.

.

.

.

.

zach

aaron

aaron

barney

Map 2

carol

.

.

multi store script
Multi-store script

A = load ‘users’ as (name, age, gender, city, state);

B = filter A by name is not null;

C1 = group Bby age, gender;

D1 = foreach C1 generate group, COUNT(B);

store D into ‘bydemo’;

C2= group Bby state;

D2 = foreach C2 generate group, COUNT(B);

store D2 into ‘bystate’;

C1: group

store into ‘bydemo’

C2: eval udf

A: load

B: filter

C2: group

store into ‘bystate’

C3: eval udf

multi store map reduce plan
Multi-Store Map-Reduce Plan

map

filter

split

local rearrange

local rearrange

reduce

multiplex

package

package

foreach

foreach

memory management
Memory Management

Use disk if large objects don’t fit into memory

  • JVM limit > phy mem - Very poor performance
  • Spill on memory threshold notification from JVM - unreliable
  • pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.
other optimizations
Other optimizations
  • Aggressive use of combiner, secondary sort
  • Lazy deserialization in loaders
  • Better serialization format
  • Faster regex lib, compiled pattern
  • Compression between MR jobs
future optimization work
Future optimization work

Improve memory management

Join + group in single MR, if same keys used

Even better skew handling

Adaptive optimizations

Automated hadoop tuning

pig fast and flexible
Pig - fast and flexible

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

More flexibility in 0.8, 0.9

  • Udfs in scripting languages (python)
  • MR job as relation
  • Relation as scalar
  • Turing complete pig (0.9)
further reading
Further reading
  • Docs - http://pig.apache.org/docs/r0.7.0/
  • Papers and talks - http://wiki.apache.org/pig/PigTalksPapers
  • Training videos in vimeo.com (search ‘hadoop pig’)