Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013
Motivation Data should be accessible, easy to discover, and easy to process for everyone.
Big Data Users at Netflix Analysts Engineers • Desires Rich Toolset Self Service Easy Rich APIs • A Single Platform / Data Architecture that Serves Both Groups
Netflix Data Warehouse - Storage S3 is the source of truth Decouples storage from processing. Persistent data; multiple/ transient Hadoop clusters Data sources Event data from cloud services via Ursula/Honu Dimension data from Cassandra via Aegisthus ~100 billion events processed / day Petabytes of data persisted and available to queries on S3.
Netflix Data Platform - Processing Long running clusters sla and ad-hoc Supplemental nightly bonus clusters For high priority ETL jobs 2,000+ instances in aggregate across the clusters
Netflix Hadoop Platform as a Service https://github.com/Netflix/genie S3
Netflix Data Platform – Primitive Service Layer Primitive, decoupled services Building blocks for more complicated tools/services/apps Serves 1000s of MapReduce Jobs / day 100+ jobs concurrently
Pig and Hive at Netflix • Hive • AdHoc queries • Lightweight aggregation • Pig • Complex Dataflows / ETL • Data movement “glue” between complex operations
Sample Pig Script* (Word Count) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACHinput_linesGENERATEFLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUPfiltered_wordsBY word; -- count the entries in each group word_count = FOREACHword_groupsGENERATE COUNT(filtered_words) AScount, group AS word; -- order the records by count ordered_word_count = ORDERword_countBY count DESC; STOREordered_word_countINTO '/tmp/number-of-words-on-internet'; * http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
Pig… • Data flows are easy & flexible to express in text • Facilitates code reuse via UDFs and macros • Allows logical grouping of operations vs grouping by order of execution. • But errors are easy to make and overlook. • Scripts can quickly get complicated • Visualization quickly draws attention to: • Common errors • Execution order / logical flow • Optimization opportunities
Lipstick • Generates graphical representations of Pig data flows. • Compatible with Apache Pig v11+ • Has been used to monitor more than 25,000 Pig jobs at Netflix
Overall Job Progress
Overall Job Progress Logical Plan
Records Loaded Logical Operator (map side) Map/Reduce Job Logical Operator (reduce side) Intermediate Row Count
Lipstick for Fast Development • During development: • Keep track of data flow • Spot common errors • Omitted (hanging) operators • Data type issues • Easily estimate and optimize complexity • Number of MR jobs generated • Map only vs full Map/Reduce jobs • Opportunities to rejigger logic to: • Combine multiple jobs into a single job • Manipulate execution order to achieve better parallelism (e.g. less blocking)
Lipstick for Job Monitoring • During execution: • Graphically monitor execution status from a single console • Spot optimization opportunities • Map vs reduce side joins • Data skew • Better parallelism settings
Lipstick for Support • Empowers users to support themselves • Better operational visibility • What is my script currently doing? • Why is my script slow? • Examine intermediate output of jobs • All execution information in one place • Facilitates communication between infrastructure / support teams and end users • Lipstick link contains all information needed to provide support.
Lipstick Architecture - Console • Implements PigProgressNotificationListener interface • Listens for: • New statements to be registered (unoptimized plan) • Script launched event (optimized, physical, M/R plan) • MR Job completion/failure event • Heartbeat progress (during execution) • Pig Plans and Progress Lipstick objects • Communicates with Lipstick Server
Pig Compilation Plans Pig Script Unoptimized Logical Plan (~1:1 logical operator / line of Pig) Lipstick associates Logical Operators with MapReduce jobs by inferring relationships between Logical and Physical Operations. Optimized Logical Plan Physical Plan MapReduce Plan (grouping of Physical Operators into map or reduce jobs)
Lipstick Architecture – JS Client • Displays and annotates graphs with status / progress • Completely decoupled from Server • Event based design • Periodically polls Server for job progress • Usability is a key focus
Solving Problems with Lipstick - Common Problem #1 My Job has stalled.
Unoptimized/Optimized Logical Plan Toggle Dangling Operator
Common Problem #2 I didn’t get the data I was expecting
Common Problem #3 I don’t understand why my job failed.
SuccessfulJob (light blue background) Failed Job (light red background)
Future of Lipstick • Annotate common errors and inefficiencies on the graph • Skew / map side join opportunities / scalar issues • E.g. Warnings / error dashboard • Provide better details of runtime performance • Timings annotated on graph • Min / median / max mapper and reducer times • Map / reduce completion over time • Search through execution history • Examine trends in runtime and data volumes • History of failure / success • Search jobs for commonalities • Common datasets loaded / saved • Better grasp data lineage • Common uses of UDFs and macros
Honey? Lipstick on Hive
Wrapping up • Lipstick is part of Netflix OSS. • Clone it on githubat http://github.com/Netflix/Lipstick • Check out the quickstart guide • https://github.com/Netflix/Lipstick/wiki/Getting-Started#1-quick-start • Get started playing with Lipstick in under 5 minutes! • We happily welcome your feedback and contributions!
Thank you! • Jeff Magnusson: email@example.com| http://www.linkedin.com/in/jmagnuss |@jeffmagnusson Jobs: http://jobs.netflix.com Netflix OSS: http://netflix.github.io Tech Blog: http://techblog.netflix.com/