1 / 13

Using Spark and Shark for Fast Cycle Analysis on Diverse Data

Using Spark and Shark for Fast Cycle Analysis on Diverse Data. Vaibhav Nivargi. 12.2.13. About ClearStory Data. Analysis in the New Data Landscape. New use cases seen in all industries.

zubin
Download Presentation

Using Spark and Shark for Fast Cycle Analysis on Diverse Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Spark and Shark for Fast Cycle Analysis on Diverse Data VaibhavNivargi 12.2.13

  2. About ClearStory Data

  3. Analysis in the New Data Landscape New use cases seen in all industries. • Live situational analysis requiring fast-cycle analysis across internal data and sources of externaldata • Multi-source analysis with data refreshing on new insights, as data from sources evolves • Large-scale analysis of structured and unstructured data combined in integrated insights

  4. Example: Interactive Multi-source Analysis More data and more people change the analysis. News Coverage Online, Print, Television Donations New Members, Donations Data Intelligence Facebook Shares, Likes, Comments Website Traffic Traffic, Referrals, Content Twitter Followers, Tweets, Retweets Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries

  5. Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ? ? ? ?

  6. Why Spark and Shark ? • RDDs • Low latency & scale • Iterative and Interactive computation • Lineage and fault tolerance • Able to re-derive data • Expressive power of Scalaand SQL • Operations beyond aggregations, joins, and statistical operators • Advanced: ML, data mining, segmentation, approximate queries, graphs … • Support for structured and semi-structured data • BDAS Stack & AMPLab • Tachyon, MLBase, BlinkDB, GraphX … • Community and adoption

  7. The ClearStory Solution Data Sources ClearStory Platform ClearStory Application Harmonization In-Memory Data Units Data Inference & Profiling Visualization Collaboration

  8. Where do Spark & Shark fit ? User Application ClearStory API Harmonization Engine and Blended Data Processing Spark Cluster + ClearStory IP Data Access, Inference and Lineage Data Source API Files Hadoop RDBMS Web Premium Public

  9. How we leverage Spark & Shark User intent captured and translated to custom API Harmonization-as-a-Service • Manages Spark and Shark query execution • Read cached data from HDFS • RESTful • Merges datasets (RDDs) on the fly – on user request • Support conversion of user actions to backend queries • Query optimizations Performance optimizations • Mixed-mode execution (sql2rdd & spark native) • Caching • Pre-computation

  10. How we leverage Spark & Shark Query results returned to the application for scalable visualization and ClearStory-specific viztechniques RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals Data updates automatically processed as source data changes ClearStory’sown deployment, packaging, and integrated monitoring for operations at scale

  11. Spark Developments – What We Like Query cancellation, progress indication (0.8.1 and beyond) More performance breakthroughs Workload Management BlinkDB MLBase Tachyon GraphX

  12. We’re Hiring! Working with the community, giving back Lots of exciting new developments This is like the early days of Hadoop – massive momentum gathering The First Spark Summit!More Meet-ups!

More Related