real time stream processing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Real-Time Stream Processing PowerPoint Presentation
Download Presentation
Real-Time Stream Processing

Loading in 2 Seconds...

  share
play fullscreen
1 / 57
mercury

Real-Time Stream Processing - PowerPoint PPT Presentation

129 Views
Download Presentation
Real-Time Stream Processing
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Real-Time Stream Processing CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Agenda • Apache Storm • Apache Spark

  3. Traditional Data Processing !!!ALL!!! the data Batch Pre-Computation (aka MapReduce) Index Query Index Query Index

  4. Traditional Data Processing • Slow... and views are out of date Absorbed into batch views Not absorbed Now Time

  5. Compensating for the real-time stuff • Need some kind of stream processing system to supplement our batch views • Applications can then merge the batch and the real time views together!

  6. How do we do that?

  7. Twitter Storm

  8. Enter: Storm • Open-Source project originally built by Twitter • Now lives in the Apache Incubator • Enables distributed, fault-tolerant real-time computation

  9. A History Lesson on Twitter Metrics Twitter Firehose

  10. A History Lesson on Metrics Twitter Firehose

  11. Problems! • Scaling is painful • Fault-tolerance is practically non-existent • Coding for it is awful

  12. Wanted to Address • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”

  13. Storm Delivers • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”

  14. Use Cases • Stream Processing • Distributed RPC • Continuous Computation

  15. Storm Architecture Supervisor ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor Supervisor

  16. Glossary • Streams • Constant pump of data as Tuples • Spouts • Source of streams • Bolts • Process input streams and produce new streams • Functions, Filters, Aggregation, Joins, Talk to databases, etc. • Topologies • Network of spouts and bolts

  17. Tasks and Topologies

  18. Grouping • When a Tuple is emitted from a Spout or Bolt, where does it go? • Shuffle Grouping • Pick a random task • Fields Grouping • Consistent hashing on a subset of tuple fields • All Grouping • Send to all tasks • Global Grouping • Pick task with lowest ID

  19. Topology [“id1”, “id2”] shuffle shuffle [“url”] shuffle all

  20. Guaranteed Message Processing • A tuple has not been fully processed until it all tuples in the “tuple tree” have been completed • If the tree is not completed within a timeout, it is replayed • Programmers need to use the API to ‘ack’ a tuple as completed

  21. Stream Processing ExampleWord Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology(“word-count”, conf, builder.createTopology());

  22. public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super(“python”, “splitsentence.py”); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields(“word”)); } } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split(“ “) for word in words: storm.emit([word])

  23. public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields(“word”, “count”)); } }

  24. Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“word-count”, conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();

  25. Command Line Interface • Starting a topology storm jar mycode.jar twitter.storm.MyTopologydemo • Stopping a topology storm kill demo

  26. Distributed RPC

  27. DRPC ExampleReach • Reach is the number of unique people exposed to a specific URL on Twitter Follower Tweeter Distinct Follower Follower Count URL Tweeter Distinct Follower Reach Follower Follower Tweeter Distinct Follower Follower

  28. Reach Topology shuffle shuffle GetTweeters GetFollowers Spout [“follower-id”] Distinct global CountAggregator

  29. Storm Review • Distributed code and configurations • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams • Serialization • Fine-grained performance stats of topologies

  30. Apache Spark

  31. Concern! • Say I have an application that involves many iterations... • Graph Algorithms • K-Means Clustering • Six Degrees of Bieber Fever • What's wrong with Hadoop MapReduce?

  32. New Frameworks! • Researchers have developed new frameworks to keep intermediate data in-memory • Only support specific computation patterns (Map...Reduce... repeat) • No abstractions for general re-use of data

  33. Enter: RDDs • Or Resilient Distributed Datasets • Fault-tolerant parallel data structures that enables: • Persisting data in memory • Specifying partitioning schemes for optimal placement • Manipulating them with a rich set of operators

  34. Apache SparkLightning-Fast Cluster Computation • Open-source top-level Apache project that came out of Berkeley in 2010 • General-purpose cluster computation system • High-level APIs in Scala, Java, and Python • Higher-level tools: • Shark for HiveQLon Spark • MLlibfor machine learning • GraphXfor graph processing • Spark Streaming

  35. Glossary

  36. RDD Persistence and Partitioning • Persistence • Users can control which RDDs will be reused and choose a storage strategy • Partitioning • What we know and love! • Hash-partitioning based on some key for efficient joins

  37. RDD Fault-Tolerance • Replicating data in-flight is costly and hard • Instead of replicating every data set, let's just log the transformations of each data set to keep its lineage • Loss of an RDD partition can be rebuilt by replaying the transformations • Only the lost partitions need to be rebuilt!

  38. RDD Storage • Transformations are lazy operations • No computations occur until an action • RDDs can be persisted in-memory, but are spilled to disk if necessary • Users can specify a number of flags to persist the data • Only on disk • Partitioning schemes • Persistence priorities

  39. RDD Eviction Policy • LRU policy at an RDD level • New RDD partition is computed, but not enough space? • Evict partition from the least recently accessed RDD • Unless it is the same RDD as the one with the new partition

  40. Example! Log Mining • Say you want a search through terabytes of log files stored in HDFS for errors and play around with them lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist()

  41. Example! Log Mining // Count number of errors logs errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array errors.filter(_.contains("HDFS")) .map(_.split('\t')(3)) .collect()

  42. Spark Execution Flow • Nothing happens to errors until an action occurs • The original HDFS file is not stored in-memory, only the final RDD • This will greatly increase all of the future actions on the RDD

  43. Architecture

  44. PageRank

  45. Spark PageRank // Load graph as an RDD of (URL, outlinks) pairs val links = spark.textFile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targetURL, float) pairs // with the contributions sent by each page valcontribs = links.join(ranks).flatMap{ (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = contribs.reduceByKey((x,y) => x+y) .mapValues(sum => a/N + (1-a)*sum) }

  46. Spark PageRank Lineage

  47. Tracking LineageNarrow vs. Wide Dependencies

  48. Scheduler DAGs

  49. Spark API • Every data set is an object, and transformations are invoked on these objects • Start with a data set, then transform it using operators like map, filter,and join • Then, do some actions like count, collect, or save

  50. Spark API