Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony ) PowerPoint Presentation
Download Presentation
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

142 Views Download Presentation
Download Presentation

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@DilipAntony)

  2. Conviva monitors and optimizes online video for premium content providers We see 10s of millions of streams every day

  3. Conviva data processing architecture Monitoring Dashboard Live data processing Reports Video Player Hadoop for historical data Ad-hoc analysis Spark

  4. Group By queries dominate our workload • SELECTvideoName, COUNT(1) • FROM summaries • WHERE date='2011_12_12' AND customer='XYZ’ • GROUP BY videoName; • 10s of metrics, 10s of group bys • Hive scans data again and again from HDFS Slow • Conviva GeoReporttook ~ 24 hours using Hive

  5. Group By queries can be easily written in Spark val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } valcachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cache valmapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } valreduceFn : (Long, Long) => Long = { (a,b) => a+b } valresults = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

  6. Spark is blazing fast! • Spark keeps sessions of interest in RAM • Repeated group by queries are very fast. • Spark-basedGeoReport runs in 45 minutes (compared to 24 hours with Hive)

  7. Spark queries require more code, but are not too hard to write • Writing queries in Scala – There is a learning curve • Type-safety offered by Scala is a great boon • Code completion via Eclipse Scalaplugin • Complex queries are easier to write in Scala than in Hive. • Cascading IF()s in Hive

  8. Challenges in using Spark • Learning Scala • Always on the bleeding edge – getting dependencies right • More tools required

  9. Spark @ Conviva today • Using Spark for about 1 year • 30% of our reports use Spark, rest use Hive • Analytics portal with canned Spark/Hive jobs • More projects in progress • Anomaly detection • Interactive console to debug video quality issues • Near real-time analysis and decision making using Spark • Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva

  10. jobs@conviva.com We are Hiring