1 / 17

Big Data - Streaming

Big Data - Streaming. Kalapriya Kannan IBM Research Labs July, 2013. Query = function(all data). Is there a general purpose way to compute arbitrary functions in real time? Example query Total number of pageviews to a URL over a range of time. Implementation.

elvina
Download Presentation

Big Data - Streaming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data - Streaming Kalapriya Kannan IBM Research Labs July, 2013

  2. Query = function(all data) • Is there a general purpose way to compute arbitrary functions in real time? • Example query • Total number of pageviews to a URL over a range of time.

  3. Implementation Too slow.. Data is petabyte scale

  4. Pre computation All Data Query Pre computed Views All Data Query

  5. Example Query Precomputed view All Data pageview 1100 pageview Query pageview pageview

  6. Pre computation Pre computed Views All Data functions functions Query

  7. Hadoop • Great at computing arbitrary functions. • Expressing these functions • Cascading, Pig, Cascalog • Scalding • HIVE

  8. Hadoop Pre computation Batch view #1 Map reduce workflow All Data Map reduce workflow Batch view #2 Look at all Batch mode DB – Elephant DB

  9. Are we done? • Not quiet • A batch workflow is too slow • View are out of date Just few hours of data Not absorbed Absorbed in Batch Views Now Time

  10. Last few hours of data… Strom

  11. Application queries Batch view Query Merge Real time view

  12. Strom concepts

  13. Strom concepts (2)

  14. What does storm do? • Distribute code • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams. • Atleast once message processing • Horizontal scalability • No intermediate queues • Less operational over head • ‘just works” Storm jar myapp.jar com.twitter. Mytopology demo

  15. Streaming word count

  16. Word count bolts and spouts

  17. Current use cases in twitter Discovery of emerging topics/stories Online learning of tweet features for search result ranking Realtime analytics for ads Internal log processing. Topology isolation

More Related