1 / 54

Watching Pigs Fly with the Netflix Hadoop Toolkit

Watching Pigs Fly with the Netflix Hadoop Toolkit. Hadoop Summit 2013 San Jose, CA. Our Motivation. Data should be accessible, easy to discover, and easy to process for everyone. Our Users. Analysts. Engineers. Hadoop Platform as a Service. Hadoop Platform as a Service. S3.

hana
Download Presentation

Watching Pigs Fly with the Netflix Hadoop Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Watching Pigs Fly with the Netflix Hadoop Toolkit Hadoop Summit 2013 San Jose, CA

  2. Our Motivation Data should be accessible, easy to discover, and easy to process for everyone.

  3. Our Users Analysts Engineers

  4. Hadoop Platform as a Service

  5. Hadoop Platform as a Service S3

  6. HadoopPlatform as a Service Data Platform

  7. Data Platform as a Service Ignite (A/B Test Analytics) Lipstick (Pig Workflow Visualization) Spock (Data Auditing) Sting (Adhoc Visualization) Looper (Backloading) Forklift (Data Movement) Genie (HadoopPaaS) Franklin (Metadata API) Event Service (Orchestration) Hadoop Other Processing S3

  8. Let’s solve a problem using the data!

  9. Build a recommender.

  10. But, what makes good recommendations? Similarity Personalization

  11. COLORS!

  12. COLORS! Box art is colorful…

  13. We’re Sorry COLORS! Box art is colorful…

  14. Where can I find the data?

  15. Hadoop Platform as a Service S3

  16. Hadoop Platform as a Service RDS Redshift Cassandra Teradata S3

  17. Data Platform as a Service Franklin (Metadata API) RDS Redshift Cassandra Teradata S3

  18. Data Platform as a Service Franklin (Metadata API)

  19. Create a dataset for box art and color.

  20. Whether your dataset is large or small, being able to visualize it makes it easier to explain.

  21. Data Platform as a Service Sting (Adhoc Visualization) Franklin (Metadata API)

  22. Sting • Allows users to cache the results of a genie job in memory • Sub second response to OLAP style operations (slicing, dicing, aggregations). • Adhoc / recurring schedule • Easy to use!

  23. Hive Query Schema

  24. % Content Consumed / Hour

  25. Hemlock Grove Arrested Development House of Cards

  26. Similarity

  27. House of Cards Macbeth

  28. Toddlers & Tiaras Star Trek: Voyager

  29. Personalization

  30. Big Data # of subscribers X # of titles = ???,000,…,000 (big data)

  31. Netflix Apache Pig

  32. Data Platform as a Service Sting (Adhoc Visualization) Franklin (Metadata API) Lipstick

  33. Lipstick • Allows users to visualize their data flow • Allows users to see common errors • Allows users to easily monitor their jobs • Empowers users to support themselves • Facilitates communication between infrastructure team and users

  34. Lipstick

  35. Overall Job Progress

  36. Overall Job Progress Logical Plan

  37. Records Loaded Logical Operator (map side) Map/Reduce Job Logical Operator (reduce side) Intermediate Row Count

  38. Hadoop Counters

  39. Common Problem #1 My Job has stalled.

  40. Unoptimized/Optimized Logical Plan Toggle Dangling Operator

  41. Common Problem #2 I didn’t get the data I was expecting

  42. Common Problem #3 I don’t understand why my job failed.

More Related