An overview of Hulu’s metrics platform - PowerPoint PPT Presentation

an overview of hulu s metrics platform n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An overview of Hulu’s metrics platform PowerPoint Presentation
Download Presentation
An overview of Hulu’s metrics platform

play fullscreen
1 / 25
An overview of Hulu’s metrics platform
446 Views
Download Presentation
chesmu
Download Presentation

An overview of Hulu’s metrics platform

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. An overview of Hulu’s metrics platform Tristan Reid tristan.reid@hulu.com Prasan Samtani prasan.samtani@hulu.com

  2. What we do • Streaming video service • > 5.5 million subscribers • > 20 million unique visitors/month • > 1 billion ads/month

  3. It all begins with beacons Living room device (Roku, Xbox, etc) Beacon collection service Mobile device (Android, iPhone, etc) Web (hulu.com)

  4. What’s in a beacon 80 2013-04-01 00:00:00 /v3/playback/start? bitrate=650 &cdn=Akamai &channel=Anime &clichéent=Explorer &computerguid=EA8FA1000232B8F6986C3E0BE55E9333 &contentid=5003673 …

  5. Reporting platform (RP2) Find Metrics & Dimensions Design and execute reports

  6. The pipeline Beacon collection service Devices Devices Devices LogCollector/Flume HDFS Monitoring (metstat) MapReduce jobs/JobScheduler Developers Hive Reporting (RP2) Harpy – continuous aggregation RDBMS Business

  7. Log Collection Devices Devices Devices … Log Collection machine #1 Log Collection machine #11 Load balancer HDFS Files bucketed by beacon type and partitioned by hour

  8. Directory hierarchy on HDFS

  9. MapReduce - going from beacons to basefacts

  10. If a program manipulates a large amount of data, it does so in a small number of ways - Alan Perlis

  11. The BeaconSpec compiler Java MapReduce code that can run on the cluster Definitions of beacons and base-facts Beaconspec compiler

  12. What does our language look like? basefactplayback_watched_uniquesfrom playback/(position|end) { dimensionharpyhour.id as hourid; dimensioncomputerguid as computerguid; dimensionuserid as userid; required dimension video.id as video_id; required dimensioncontentPartner.id as content_partner_id; … dimensionsiteSessionId.chosen as site_session_id; dimensionfacebook.isfacebookconnected as is_facebook_connected; factsum(watched.out) as watched; } FAQ: Why didn’t we just use Pig?

  13. The superior [program] cultivates itself so as to give rest to [programmers] - Confucius, the Way of the Superior Man

  14. Scheduling jobs Outside world MapReduce job MapReduce job MapReduce job JobMonitor JobMonitor JobMonitor JobScheduler Interface JobScheduler Logmanager databases Checks databases for jobs that are ready to run and whether dependencies are met

  15. JobScheduler technology • The actor model of concurrency • Communication through async messaging • Completely encapsulated state

  16. Message passing Actor creation Central idea: Treat local objects as if they are distributed, as opposed to treating distributed objects as if they are local

  17. Fault-tolerance – let it crash!

  18. Harpy – continuous aggregations Harpy Metadata Queue Processor Hive DataSync Publishing HDFS NFS Holding Sweeper Agg Scheduler HoldingDB Output DBs

  19. RP2 • Reporting Portal for pulling Metrics + Dimensions • Quick ‘Demo’

  20. Let’s Reexamine the pipeline: Beacon collection service Devices Devices Devices LogCollector/Flume HDFS Monitoring (metstat) MapReduce jobs/JobScheduler Developers Hive Reporting (RP2) Harpy – continuous aggregation RDBMS Business

  21. Metstat • Python Django App • Tasks on Celery + RabbitMQ • JQuery • Tracks status, status changes and statistics • Gets data directly from various sources (databases, HDFS)

  22. FAQ: Why didn’t we just use Pig? • Dataflow language – runs on Hadoop • Pig philosophy • (Taken from the Apache website) • Pigs eat anything • Pigs live anywhere • Pigs are domestic animals • Pigs fly Beaconspec

  23. REGISTER ./tutorial.jar; raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query); clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; Beware of the Turing tar-pit where everything is possible but nothing of interest is easy - Alan Perlis Beaconspec

  24. FAQ: What is open sourced? • Slickint – database interface generation for Scala • github.com/zenbowman/slickint • Local filesystem caching for hadoop • github.com/ZenBowman/luna