1 / 15

Scalable Autonomic Streaming Middleware for Real-Time Processing of Massive Data Flows

Scalable Autonomic Streaming Middleware for Real-Time Processing of Massive Data Flows Ricardo Jimenez-Peris Universidad Politecnica de Madrid Project Coordinator. Project Data. Start: February 2008. Duration: 3 years. Partners: UPM – Spain ( coord .). FORTH - Greece.

mead
Download Presentation

Scalable Autonomic Streaming Middleware for Real-Time Processing of Massive Data Flows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Autonomic Streaming Middleware for Real-Time Processing of Massive Data Flows Ricardo Jimenez-Peris Universidad Politecnica de Madrid Project Coordinator

  2. Project Data • Start: February 2008. • Duration: 3 years. • Partners: • UPM – Spain (coord.). • FORTH - Greece. • TU Dresden - Germany. • Telefonica - Spain. • Exodus - Greece. • Epsilon - Italy.

  3. Background • Data streaming is a new paradigm developed in the database community to process large data flows in memory in an online fashion. • It allows to perform continuous queries over flowing data. • Most existing platforms are centralized, and a few distributed, and perform 1-2 orders of magnitude better than relational DBs.

  4. Background: Data Streaming Operators

  5. Background: Data Streaming Query

  6. Scope • Many potential applications in Internet today require to process huge amounts of information in an online fashion: • Mitigation of DDoS attacks. • Spam filtering. • Processing the output of sensor networks. • Detecting fraud in cellular telephony. • Financial applications. • QoS monitoring for enforcing SLAs. • Real time data mining. • Etc.

  7. Objectives • Stream aims at developing a highly scalable middleware infrastructure to process massive data flows in real time. • The innovation lies in the sheer scale targeted by the project 1-2 orders of magnitude higher than current technology.

  8. Innovation • Parallelizing data streaming operators: • Currently a query operator can be deployed on a single site and it has to process the full data flow thus becoming the bottleneck. • Stream is developing distributed versions of query operators that enable to run individual query operators in a cluster of sites.

  9. Innovation: Parallel Data Streaming Op1 Op1 O p2 O p2 O p3 upstream downstream upstream downstream upstream Op1 Op1 O p2 O p2 O p3 upstream downstream upstream downstream upstream Op1 Op1 O p2 O p2 O p3 upstream downstream upstream downstream upstream

  10. Innovation • Exploiting leading edge high performance networks and IO systems: • Reaching 40 gbs for both networking and IO. • This results in high throughput communication among sites and very low latency. • Low cost storage system: • 1 PC controlling 40 disks.

  11. Architecture Autonomic Controller Layer Data Mining Layer Parallel Data Streaming Layer Data Streaming Layer High Performance IO & Storage Layer

  12. Innovation • Self-healing: • Able to tolerate failures  Novel approach. • Able to online recover new nodes. • Self-configuring: • Dynamic load balancing. • Self-provisioning: • Nodes are added and removed as needed depending on the load.

  13. Expected Outcome • Highly scalable and autonomic infrastructure to process massive data flows. • 2 orders of magnitude more scalable than current distributed data streaming platforms. • Application to 3 different markets: • Telco: Fighting fraud in cellular telephony. • Services: Real-time checking of SLAs fulfillment. • Financial/banking: Detection of laundry financial operations/Fraud detection in credit card payments/Real time data warehousing.

  14. Current Status • Month 8 of the project. • Prototypes of all layers (except automic controller foreseen for the 2nd year). • Cluster with 50 nodes interconnected with Myrinet10G setup. • First tests of parallel data streaming exhibiting high scalability. • Prototypes of IO and storage tiers in advanced state.

  15. Questions?

More Related