Distributed Stream Processing Strategies Presented by Ming Jiang: A Detailed Review
E N D
Presentation Transcript
Scalable Distributed Stream Processing Presented by Ming Jiang
Situation when distributed • A distributed federation of participating nodes in different administrative domains • Collaboration between different domains required
Two complementary efforts for the situation • Aurora* intra-participant distribution • Medusa inter-participant distribution
Three pieces to be shard • Aurora • An overlay network of communication • Algorithms for high-availability
Three architectural issues • Communications • Load sharing • High availability in the presence of failure
Communications • Naming (participants, entity-name) • Routing 1. a data source or an administrator registers a schema and a stream 2. When DS produce an event, labels
Communications • Message Transport multiplexing all the message streams on a single TCP connection • Remote definition: process migration is too complicated
Load Management Repartitioning Aurora Networks, based on loads and resources: • Box Sliding • Box Splitting
Box Sliding • Takes a box on the edge of a sub-network on one machine and shifts it to its neighbor. upstream box sliding
Box Splitting • Create a copy of a box that is intended to run on second machine, to offload • Need a filter as router
Box splitting Tumble Merge: Box splitting has to be transparent
Box splitting • If predicate in filter is: B<3 A machine: 1,2,3,4,7 B machine: 5,6 A machine B machine final result after merge
Key partitioning Challenges • Choosing what to offload • Choosing what to split • Choosing filters • Others…
High Availability Utilize the push-based nature
Failure detection and Recovery • 1. periodically send heartbeat msgs to upstream neighbors • 2. if any server does not reply for pre-defined time, we assume it failed • 3. initiate recovery phase, emulating the process of failed server (load shedding can be used)