250 likes | 444 Views
pig. pig. Making Hadoop Easy. http://hadoop.apache.org/pig. What is Pig. Pig is a Language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language. Pig is Hadoop Subproject. Apache Incubator: October’07-October’08
E N D
pig pig Making Hadoop Easy http://hadoop.apache.org/pig
Pig is a Language An engine that executes Pig Latin locally or on a Hadoop cluster. Pig Latin, a high level data processing language.
Pig is Hadoop Subproject • Apache Incubator: October’07-October’08 • Graduated into Hadoop Subproject • Main page: http://hadoop.apache.org/pig/
Why Pig? • Higher level languages: • Increase programmer productivity • Decrease duplication of effort • Open the system to more users • Pig insulates you against hadoop complexity • Hadoop version upgrades • JobConf configuration tuning • Job chains
An Example Problem Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = join Fltrd by name, Pages by user;Grpd = group Jnd by url;Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;Srtd = order Smmd by clicks desc;Top5 = limit Srtd 5;store Top5 into‘top5sites’;
Ease of Translation Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Users = load …Fltrd = filter … Pages = load …Jnd = join …Grpd = group …Smmd = … COUNT()…Srtd = order …Top100 = limit … Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
Comparison 1/20 the lines of code 1/16 the development time Performance within 2x
Pig Compared to Map Reduce • Faster development time • Many standard data operations (project, filter, join) already included. • Pig manages all the details of Map Reduce jobs and data flow for you.
And, You Don’t Lose Power • Easy to provide user code throughout. External binaries can be invoked. • Metadata is not required, but metadata supported and used when available. • Pig does not impose a data model on you. • Fine grained control. One line equals one action. • Complex data types
Example, User Code -- use a custom loaderLogs = load‘apachelogfile’using CommonLogLoader() as (addr, logname, user, time, method, uri, p, bytes);-- apply your own functionCleaned = foreach Logs generate addr, canonicalize(url) as url;Grouped = group Cleaned by url;-- run the result through a binaryAnalyzed = stream Grouped through‘urlanalyzer.py’;store Analyzed into‘analyzedurls’;
Example, Schema on the Fly -- declare your typesGrades = load‘studentgrades’as (name: chararray, age: int, gpa: double);Good = filter Grades by age > 18 and gpa > 3.0; -- ordering will be by typeSorted = order Good by gpa;store Sorted into‘smartgrownups’;
How it Works Pig Latin script is translated to a set of operators which are placed in one or more M/R jobs and executed. Filter $1 > 0 Map A = load ‘myfile’; B = filter A by $1 > 0; C = group B by $0; D = foreach C generate group, COUNT(B) as cnt;E = filter D by cnt > 5;dump E; COUNT(B) Combiner COUNT(B) Filter cnt > 0 Reducer
Current Pig Status • 30% of all Hadoop jobs at Yahoo are now pig jobs, 1000s per day. • Graduated from Apache Incubator in October’08 and was accepted as Hadoop sub-project. • In the process of releasing version 0.2.0 • type system • 2-10x speedup • 1.6x Hadoop latency • Improved user experience: • Improved documentation • PigTutorial • UDF repository – PiggyBank • Development environment (eclipse plugin)
What Users Do with Pig • Inside Yahoo (based on user interviews) • Used for both production processes and adhoc analysis • Production • Examples: search infrastructure, ad relevance • Attraction: fast development, extensibility via custom code, protection against hadoop changes, debugability • Research • Examples: user intent analysis • Attraction: easy to learn, compact readable code, fast iteration on trying new algorithms, easy for collaboration
What Users Do with Pig • Outside Yahoo (based on mail list responses) • Processing search engine query logs • “Pig programs are easier to maintain, and less error-prone than native java programs. It is an excellent piece of work.” • Image recommendations • “I am using it as a rapid-prototyping language to test some algorithms on huge amounts of data.” • Adsorption Algorithm (video recommendations) • Hoffman's PLSI implementation in PIG • “The E/M login was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in mapreduce java. Exactly that’s the reason I wanted to try it out in Pig. It took ~ 3-4 days for me write it, starting from learning pig :)” • Inverted Index • “The Pig feature that makes it stand out is the easy native support for nested elements — meaning, a tuple can have other tuples nested inside it; they also support Maps and a few other constructs. The Sigmod 2008 paper presents the language and gives examples of how the system is used at Yahoo. • Without further ado — a quick example of the kind of processing that would be awkward, if not impossible, to write in regular SQL, and long and tedious to express in Java (even using Hadoop).”
What Users Do with Pig • Common asks • Control structures or embedding • UDFs in scripting languages (Perl, Python) • More performance
Roadmap • Performance • Latency: goal of 10-20 % overhead compared to hadoop • Better scalability: memory usage, dealing with skew • Planned improvements • Multi-query support • Rule-based optimizer • Handling skew in joins • Pushing projections to the loader • More efficient serialization • Better memory utilization
Roadmap (cont.) • Functionality • UDFs in languages other than Java • Perl, C++ • New Parser with better error handling
How Do I Get a Pig of My Own? • Need an installation of hadoop to run on, seehttp://hadoop.apache.org/core/ • Get the pig jar. You can get release 0.1.0 at http://hadoop.apache.org/pig/releases.html. I strongly recommend using the code from trunk Get a copy of thehadoop-site.xmlfile for your hadoop cluster. • Runjava –cp pig.jar:configdir org.apache.pig.Mainwhere configdir is the directory containing yourhadoop-site.xml.
How Do I Make My Pig Work? • Starting pig with no script puts you in the grunt shell, where you can type pig and hdfs navigation commands. • Pig Latin can be put in file that is then passed to pig. • JDBC like interface for java usage. • PigPen, an Eclipse plugin that supports textual and graphical construction of scripts. Shows sample data flowing through the script to illustrate how your script will work.