1 / 16

Pig, Making Hadoop Easy

Interested in Learning Big Data and Hadoop. Click here for more info https://www.dezyre.com/Hadoop-Training-online/19

DeZyre
Download Presentation

Pig, Making Hadoop Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pig, Making Hadoop Easy Alan F. Gates Yahoo!

  2. Who Am I? • Pig committer • Hadoop PMC Member • An architect in Yahoo!grid team • Or, as one coworker put it, “the lipstick on the Pig”

  3. Who are you?

  4. Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

  5. In Map Reduce

  6. In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;

  7. Performance 0.1 0.4, 0.5 0.2 0.3 0.6, 0.7

  8. Why not SQL? Data Factory Pig Pipelines Iterative Processing Research • Data Warehouse • Hive • BI Tools • Analysis Data Collection

  9. Pig Highlights • User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) • UDFs can be written to take advantage of the combiner • Four join implementations built in: hash, fragment-replicate, merge, skewed • Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned • Order by provides total ordering across reducers in a balanced way • Writing load and store functions is easy once an InputFormat and OutputFormat exist • Piggybank, a collection of user contributed UDFs

  10. Who uses Pig for What? • 70% of production jobs at Yahoo (10ks per day) • Also used by Twitter, LinkedIn, Ebay, AOL, … • Used to • Process web logs • Build user behavior models • Process images • Build maps of the web • Do research on raw data sets

  11. Accessing Pig • Submit a script directly • Grunt, the pig shell • PigServer Java class, a JDBC like interface

  12. Components User machine Job executes on cluster Hadoop Cluster Pig resides on user machine No need to install anything extra on your Hadoop cluster.

  13. How It Works Pig Latin A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; • pig.jar: • parses • checks • optimizes • plans execution • submits jar to Hadoop • monitors job progress Execution Plan Map: Filter Count Combine/Reduce: Sum

  14. Demo • s3://hadoopday/pig_tutorial

  15. Upcoming Features • In 0.8 (plan to branch end of August, release this fall): • Runtime statistics collection • UDFs in scripting languages (e.g. python) • Ability to specify a custom partitioner • Adding many string and math functions as Pig supported UDFs • Post 0.8 • Adding branches, loops, functions, and modules • Usability • Better error messages • Fix ILLUSTRATE • Improved integration with workflow systems

  16. Learn More • Read the online documentation: http://hadoop.apache.org/pig/ • On line tutorials • From Yahoo, http://developer.yahoo.com/hadoop/tutorial/ • From Cloudera, http://www.cloudera.com/hadoop-training • Using Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728 • A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore • Join the mailing lists: • pig-user@hadoop.apache.org for user questions • pig-dev@hadoop.apache.com for developer issues • howldev@yahoogroups.com for Howl

More Related