1 / 12

Spark in the Hadoop Ecosystem

Spark in the Hadoop Ecosystem. Eric Baldeschwieler (a.k.a. Eric14) twitter: @jeric14. Who is Eric14. A Hadoop ecosystem cheerleader & Tech Advisor Previously CTO/CEO of Hortonworks VP Hadoop Engineering @ Yahoo!. Spark “on the radar”.

milt
Download Presentation

Spark in the Hadoop Ecosystem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark in the Hadoop Ecosystem • Eric Baldeschwieler (a.k.a. Eric14) • twitter: @jeric14

  2. Who is Eric14 • A Hadoop ecosystem cheerleader & Tech Advisor • Previously • CTO/CEO of Hortonworks • VP Hadoop Engineering @ Yahoo!

  3. Spark “on the radar” • 2008 - Yahoo! Hadoop team collaboration w Berkeley Amp/Rad lab begins • 2009 - Spark example built for Nexus -> Mesos • 2011 - “Spark is 2 years ahead of anything at Google” - Conviva seeing good results w Spark • 2012 - Yahoo! working with Spark / Shark • Today - Many success stories - Early commercial support

  4. Spark Today

  5. Spark updates Hadoop • Hardware had advanced since Hadoop started: • Very large RAMs, Faster networks (10Gb+) • Bandwidth to disk not keeping up • MapReduce awkward for key big data workloads: • Low latency dispatch (E.G. quick queries) • Iterative algorithms (E.G. ML, Graph…) • Streaming data ingest

  6. Spark, “lingua franca?” • Support for many development techniques • SQL, Streaming, Graph & in memory, MapReduce • Write “UDFs” once and use in all contexts • Small, simple & elegant API  • Easy to learn and use; expressive & extensible • Retains advantages of MapReduce (fault tolerance…)

  7. Spark often better • Today you will hear many success stories from teams who have converted Hadoop based workloads to Spark and seen: • Huge speedups • Big cost savings • But there do exist cases were Hadoop is superior… • Proven to work at the largest scales • Mature & widely commercially supported • Much larger ecosystem of solutions and tools

  8. Spark complements Hadoop • Hadoop 2.x being widely deployed this year • Spark now just another kind of Yarn Job • Spark leverages Hadoop ecosystem • HDFS, HCatalog, Data Input/OutputFormats • Huge investment in data collection & tooling

  9. The Future

  10. Spark and Hadoop • Hadoop vs. BDAS is not the story! • Yes spark is useful separately from Hadoop and may flourish independently… • But with Yarn, Spark can be brought to the data in every Hadoop 2 cluster in the world… • Complimenting the investment already made by enterprise.

  11. Spark the “lingua franca” • Data scientists & Developers need an open standard for sharing their Algorithms & functions, an “R” for big data. • Spark best current candidate: • Open Source - Apache Foundation • Expressive (MR, iteration, Graphs, SQL, streaming) • Easily extended & embedded (DSLs, Java, Python…)

  12. “The future is already here. It’s just not very evenly distributed.” – William Gibson

More Related