1 / 22

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations. Thejas Nair pig team @ Yahoo! Apache pig PMC member. http://pig.apache.org. What is Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster.

Download Presentation

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig team @ Yahoo! Apache pig PMC member http://pig.apache.org

  2. What is Pig? An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

  3. Pig Latin example Users = load‘users’as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user;

  4. Comparison with MR in Java 1/20 the lines of code 1/16 the development time What about Performance ?

  5. Pig Compared to Map Reduce Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included Manages all the details of connecting jobs and data flow Copes with Hadoop version change issues

  6. And, You Don’t Lose Power UDFs can be used to load, evaluate, aggregate, and store data External binaries can be invoked Metadata is optional Flexible data model Nested data types Explicit data flow programming

  7. Pig performance Pigmix : pig vs mapreduce

  8. Pig optimization principles vs RDBMS: There is absence of accurate models for data, operators and execution env Use available reliable info. Trust user choice. Use rules that help in most cases Rules based on runtime information

  9. Logical Optimizations Parser Logical Optimizer Script A = load B = foreach C = filter Logical Plan A -> B -> C Optimized L. Plan A -> C -> B Restructure given logical dataflow graph • Apply filter, project, limit early • Merge foreach, filter statements • Operator rewrites

  10. Physical Optimizations Translator Optimizer Optimized L. Plan X -> Y -> Z Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Physical plan: sequence of MR jobs having physical operators. • Built-in rules. eg. use of combiner • Specified in query - eg. join type

  11. Hash Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user; Map 1 Reducer 1 (1, user) Pages Users Pages block n (1, fred) (2, fred) (2, fred) Map 2 Reducer 2 Users block m (1, jane) (2, jane) (2, jane) (2, name)

  12. Skew Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘skewed’; Map 1 Reducer 1 SP (1, user) Pages Users Pages block n (1, fred, p1) (1, fred, p2) (2, fred) SP Map 2 Reducer 2 Users block m (1, fred, p3) (1, fred, p4) (2, fred) (2, name)

  13. Merge Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘merge’; Map 1 Pages Users Pages Users aaron… amr aaron … aaron . . . . . . . . zach aaron . . . . . . zach Map 2 Pages Users amy… barb amy …

  14. Replicated Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘replicated’; Map 1 Pages Pages Users Users aaron aaron . . . . . . . zach aaron . zach aaron… amr aaron . zach Map 2 Pages Users aaron . zach amy… barb

  15. Group/cogroup optimizations • On sorted and ‘collected’ data • grp = group Users by name using ‘collected’; Pages Map 1 aaron aaron barney carol . . . . . . . zach aaron aaron barney Map 2 carol . .

  16. Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group Bby age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group Bby state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; C1: group store into ‘bydemo’ C2: eval udf A: load B: filter C2: group store into ‘bystate’ C3: eval udf

  17. Multi-Store Map-Reduce Plan map filter split local rearrange local rearrange reduce multiplex package package foreach foreach

  18. Memory Management Use disk if large objects don’t fit into memory • JVM limit > phy mem - Very poor performance • Spill on memory threshold notification from JVM - unreliable • pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.

  19. Other optimizations • Aggressive use of combiner, secondary sort • Lazy deserialization in loaders • Better serialization format • Faster regex lib, compiled pattern • Compression between MR jobs

  20. Future optimization work Improve memory management Join + group in single MR, if same keys used Even better skew handling Adaptive optimizations Automated hadoop tuning …

  21. Pig - fast and flexible Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/ More flexibility in 0.8, 0.9 • Udfs in scripting languages (python) • MR job as relation • Relation as scalar • Turing complete pig (0.9)

  22. Further reading • Docs - http://pig.apache.org/docs/r0.7.0/ • Papers and talks - http://wiki.apache.org/pig/PigTalksPapers • Training videos in vimeo.com (search ‘hadoop pig’)

More Related