The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

The Pig Latin Dataflow Language • A Brief Overview • James Jolly • University of Wisconsin-Madison • jolly@cs.wisc.edu

What is Pig Latin? • set-oriented data transformation language • primitives filter, combine, split, and order data • users describe transformations in steps • steps bundled into queries • each set transformation is stateless • flexible data model • nested bags of tuples • semi-structured datatypes • extensible • supports user-defined functions 2

How is it used in practice? • useful for computations across large, distributed datasets • abstracts away details of execution framework • users can change order of steps to improve performance • often used in tandem with Hadoop and HDFS • transformations converted to MapReduce dataflows • HDFS tracks where data is stored • operations scheduled nearby their data 3

An example... Given two datasets: list of words and their frequency of appearance on webpages list of users and webpages they visit Let’s find words users might be interested in lately. 4

Dataset: words and their frequency of appearance... websitewordfrequencydate news.bbc.co.uk obama 0.010 20081005 abcnews.go.com scheme 0.025 20081010 abcnews.go.com bombing 0.021 20081006 www.foxnews.com bush 0.001 20081006 www.cnn.com mccain 0.031 20081017 www.cnn.com obama 0.001 20081002 www.reuters.com bush 0.012 20080921 abcnews.go.com congress 0.002 20080927 www.reuters.com bush 0.012 20080921 www.foxnews.com bush 0.001 20081006 www.latimes.com abortion 0.001 20081015 www.latimes.com attack 0.010 20081015 www.reuters.com obama 0.005 20080917 www.foxnews.com economy 0.038 20081006 5

Dataset: webpages users visit... websiteuser www.reuters.com bill news.bbc.co.uk mike www.cnn.com mike www.foxnews.com bill www.reuters.com drew www.latimes.com james abcnews.go.com james 6

Loading word frequency data... freqs = LOAD '/home/jolly/TestData/NewsWords.txt' USING PigStorage(',')‏ AS (website_indexed, word, freq, date); (news.bbc.co.uk, obama, 0.010, 20081005)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.cnn.com, mccain, 0.031, 20081017)‏ (www.cnn.com, obama, 0.001, 20081002)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (abcnews.go.com, congress, 0.002, 20080927)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.latimes.com, abortion, 0.001, 20081015)‏ (www.latimes.com, attack, 0.010, 20081015)‏ (www.reuters.com, obama, 0.005, 20080917)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ 7

Hmm, we have some repeats... (news.bbc.co.uk, obama, 0.010, 20081005)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.cnn.com, mccain, 0.031, 20081017)‏ (www.cnn.com, obama, 0.001, 20081002)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (abcnews.go.com, congress, 0.002, 20080927)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.latimes.com, abortion, 0.001, 20081015)‏ (www.latimes.com, attack, 0.010, 20081015)‏ (www.reuters.com, obama, 0.005, 20080917)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ 8

Duplicate data no more! distinct_freqs = DISTINCT freqs; (www.cnn.com, obama, 0.001, 20081002)‏ (www.cnn.com, mccain, 0.031, 20081017)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (abcnews.go.com, congress, 0.002, 20080927)‏ (news.bbc.co.uk, obama, 0.010, 20081005)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ (www.latimes.com, attack, 0.010, 20081015)‏ (www.latimes.com, abortion, 0.001, 20081015)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (www.reuters.com, obama, 0.005, 20080917)‏ 9

Hmm, these tuples are old… (www.cnn.com, obama, 0.001, 20081002)‏ (www.cnn.com, mccain, 0.031, 20081017)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (abcnews.go.com, congress, 0.002, 20080927)‏ (news.bbc.co.uk, obama, 0.010, 20081005)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ (www.latimes.com, attack, 0.010, 20081015)‏ (www.latimes.com, abortion, 0.001, 20081015)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (www.reuters.com, obama, 0.005, 20080917)‏ 10

... and these (green) tuples are not very significant. (www.cnn.com, obama, 0.001, 20081002)‏ (www.cnn.com, mccain, 0.031, 20081017)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (abcnews.go.com, congress, 0.002, 20080927)‏ (news.bbc.co.uk, obama, 0.010, 20081005)‏ (www.foxnews.com, bush, 0.001, 20081006)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ (www.latimes.com, attack, 0.010, 20081015)‏ (www.latimes.com, abortion, 0.001, 20081015)‏ (www.reuters.com, bush, 0.012, 20080921)‏ (www.reuters.com, obama, 0.005, 20080917)‏ 11

Let’s filter them out. important_freqs = FILTER distinct_freqs BY date > 20081001 AND freq > 0.002; (www.cnn.com, mccain, 0.031, 20081017)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (news.bbc.co.uk, obama, 0.010, 20081005)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ (www.latimes.com, attack, 0.010, 20081015)‏ 12

Hmm, we don’t need these anymore... (www.cnn.com, mccain, 0.031, 20081017)‏ (abcnews.go.com, scheme, 0.025, 20081010)‏ (abcnews.go.com, bombing, 0.021, 20081006)‏ (news.bbc.co.uk, obama, 0.010, 20081005)‏ (www.foxnews.com, economy, 0.038, 20081006)‏ (www.latimes.com, attack, 0.010, 20081015)‏ 13

Let’s project them out. websites_to_words = FOREACH important_freqs GENERATE website_indexed, word; (www.cnn.com, mccain)‏ (abcnews.go.com, scheme)‏ (abcnews.go.com, bombing)‏ (news.bbc.co.uk, obama)‏ (www.foxnews.com, economy)‏ (www.latimes.com, attack)‏ 14

Now we are ready to join our lists. Websites to Users (news.bbc.co.uk, mike)‏ (www.cnn.com, mike)‏ (www.foxnews.com, bill)‏ (www.reuters.com, drew)‏ (www.latimes.com, james)‏ (abcnews.go.com, james)‏ Websites to Words (www.cnn.com, mccain)‏ (abcnews.go.com, scheme)‏ (abcnews.go.com, bombing)‏ (news.bbc.co.uk, obama)‏ (www.foxnews.com, economy)‏ (www.latimes.com, attack)‏ 15

Joining on website: finding words interesting to users... users_to_words_equijoin = JOIN websites_to_users BY website_visited, websites_to_words BY website_indexed; users_to_words = FOREACH users_to_words_equijoin GENERATE user, word; (mike, mccain)‏ (james, scheme)‏ (james, bombing)‏ (mike, obama)‏ (bill, economy)‏ (james, attack)‏ 16

Let’s group our results. interests = GROUP users_to_words BY user; (bill, {(bill, economy)})‏ (mike, {(mike, mccain), (mike, obama)})‏ (james, {(james, scheme), (james, bombing), (james, attack)})‏ 17

How does it work? • logic factored into MapReduce jobs • mapper processes run on machines with input tuples • input tuples processed using MAP( ) function,producing intermediate tuples • intermediate tuples grouped together,transferred to reducer nodes • reducer processes consume intermediate tupleswith REDUCE( ) function 18

Translating Pig Latin to MapReduce... transformed_by_map = FOREACH input_tuple GENERATE MAP(*); intermediate_tuple_partition = GROUP transformed_by_map BY input_tuple_key; result_tuples = FOREACH intermediate_tuple_partition GENERATE REDUCE(*); These statements can be executed using a single MapReduce job: 19

Example message traffic... 20

Why Pig Latin? Why not a C library? We could just supply MAP( ) and REDUCE( ) to a C library... Pig Latin allows you to: • describe long tasks • in a friendly scripting language • use many built-in datatypes • support for semi-structured data • use many built-in functions • filters, projections, joins, unions, splits, etc. • tends to make user-defined functions simpler 21

Why Pig Latin? Why not SQL? Pig Latin: • is imperative • lets users manually tune query execution plan • doesn’t need a schema • can easily read, write, and represent semi-structured data 22

Pig Latin really describes a generic dataflow. inputs = LOAD ‘input.txt’; results = FILTER inputs BY IsBoring(important_attribute); STORE results into ‘results.txt’; 23

Summary Pig Latin programs: • typically operate on large volumes of unstructured data • describe a dataflow between primitive operations • many RDBMS-like operations built into the language • custom operations can be provided by the user • user specifies order of operations • dataflows can be executed using MapReduce paradigm Thanks for listening! 24

The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Presentation Transcript

(Hadoop) Pig Dataflow Language

University of Wisconsin-Madison Arboretum

High Level Language: Pig Latin

WiMAX at the University of Wisconsin-Madison

University of Wisconsin-Madison

University of Wisconsin -Madison

University of Wisconsin-Madison 2002-03

James N. Bellinger University of Wisconsin-Madison 5-August-2010

INTERNATIONALIZING JAMES MADISON UNIVERSITY

A.D. Crowe and A.M. Thompson University of Wisconsin – Madison, Madison, Wisconsin

Tom Danielson University of Wisconsin – Madison

James N. Bellinger University of Wisconsin-Madison 2-February-2011

University of Wisconsin-Madison

William A. Craig, MD University of Wisconsin Madison, Wisconsin

James N. Bellinger University of Wisconsin-Madison 13-October-2009

James N. Bellinger University of Wisconsin-Madison 18-June-2010

James Madison University

University of Wisconsin Madison

University of Wisconsin - Madison

James N. Bellinger University of Wisconsin-Madison 13-August-2010

James N. Bellinger University of Wisconsin-Madison 6-December-2009

James N. Bellinger University of Wisconsin-Madison 11-Mar-2010