1 / 38

InfoSphere Streams

Tushar Kale Big Data Evangelist – Streams Architect tusharkale@tusharkale.com. InfoSphere Streams. Agenda. Overview Architecture Customer Use Cases. Big Data = Variety, Velocity, and Volume.

zenia
Download Presentation

InfoSphere Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tushar Kale Big Data Evangelist – Streams Architect tusharkale@tusharkale.com InfoSphere Streams

  2. Agenda • Overview • Architecture • Customer Use Cases

  3. Big Data = Variety, Velocity, and Volume Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible.

  4. InfoSphere Streams A Platform to Run In-Motion Analytics on BIG Data Real time delivery Handles up to Petabytes of data per day Supports traditional as well as non-traditional data (Audio, Video etc.) Delivers insights with microsecond latencies Supports custom analytics written in C++/Java and warehouse analytic models Single instance can support multiple applications Volume Environment Monitoring ICU Monitoring Powerful Analytics Telco churn predict Algo Trading Variety Smart Grid Cyber Security Government / Law enforcement Velocity Millions of events per second Microsecond Latency ComplexAnalytics Traditional / Non-traditional data sources Agility

  5. directory:”/img" filename:“farm” directory:”/img" filename:“bird” directory:”/opt" filename:“java” directory:”/img" filename:“cat” height:640 width:480 data: height:1280 width:1024 data: height:640 width:480 data: Stream Computing Illustrated tuple 5

  6. What can Streams do for you? • Analyze and react to events as they are happening • Take advantage of more sources of data in “true” real time • Build models on your most up-to-the-second information that will help predict what happens next • Streams is a middleware and language for building and running analytic applications operating on data in motion • Scale – easily handles a few events per second through multiple millions of events per second • Reaction time – possible to get actionable results in much less than a second (< 20 micros possible) • Enables TRUE situational awareness

  7. BIG Data – Extending the Warehouse Warehouse Traditional / Relational Data Sources Results Database & Warehouse At-Rest Data Analytics Streams Non-Traditional / Non-Relational Data Sources In-Motion Analytics Ultra Low Latency Results InfoSphere Streams Non-Traditional/Non-RelationalData Sources Internet Scale Internet Scale Data Analytics, Data Operations & Model Building Results InfoSphere BigInsights Traditional/Relational Data Sources

  8. Adaptive AnalyticsIntegrating Analytics on Data in Motion and Data at Rest Visualization of real-time and historical insights Data Integration, data mining, machine learning, statistical modeling InfoSphere Streams 1. Data Ingest Data InfoSphere BigInsights, Database & Warehouse 2. Bootstrap/Enrich Data ingest, preparation, online analysis, model validation Control flow 3. Adaptive Analytics Model

  9. Agenda • Overview • Architecture • Customer Use Cases

  10. What are key differentiating technical capabilities of Streams? Language built for Streaming applications: Reusable operators Rapid application development Continuous “pipeline” processing Performance and Scaling: Operator Fusing and Threading Efficient use of cores Distributed execution Very fast data exchange Use the data that gives you a competitive advantage: Can handle virtually any data type Use data that is too expensive and time sensitive for traditional approaches Easy to extend: Built in adaptors Users add capability with familiar C++ and Java Dynamic analysis: Programmatically change topology at runtime Create new subscriptions Create new port properties Flexible and high performance transport: Very low latency High data rates Easy to manage: Automatic placement Extend applications incrementally without downtime Multi-user / multiple applications 10

  11. Front Office 3.0 InfoSphere Streams Tools and Technology Integration Runtime Environment Streams Processing Language and IDE Streams Console & Monitoring, Built-in Stream Relational Analytics,Adapters, Toolkits Highly Scalable stream processing runtime Streams Studio Eclipse IDE for SPL Supported on x86 hardware, RedHat Enterprise Linux Version 5 (5.3 and up)

  12. stream A stream connection O2 MySink A O1 MySrc O3 MySink (stream<Type> A) as O1 = MySrc() {}() as O2 = MySink(A) {}() as O3 = MySink(A) {} Terminology • Application • Data flow graph of operator instances connected to each other via stream connections • Operator • Reusable stream analytic • Input ports: receives data / Output ports: produces data • Source: No input ports / Sink: No output ports • Operator Instance • A specific instantiation of an operator • Stream • Continuous series of tuples, generated by an operator instance’s output port • Stream connection • A stream connected to a specific operator instance input port • PE • A runtime process that executes a set of operator instances • Job • An application instance running on a set of hosts

  13. InfoSphere Streams Programming Model Sink Adapters Operator Repository Source Adapters Application Programming (SPL) Platform optimized compilation

  14. The Split operator is used for dividing incoming tuples into separate streams for parallel processing The Aggregate operator is used for grouping and summarization of incoming tuples The Delay operator is used to “artificially” slowdown a stream The Functor operator is used for performing tuple-level manipulations The Punctor operator is for inserting punctuation marks in streams The Join operator is used for correlating two streams The Barrier operator is used as a synchronization point The Sort operator is used for imposing an order onincoming tuples in a stream Streams Core Analytical CapabilitiesStreams Built-in Relational and Utility Operators And more!

  15. The ODBCSource operator is used for reading data from databases, such as DB2, IDS, Oracle The ODBCAppend operator is used for writing data to databases, such as DB2, IDS, Oracle The ODBCEnrich operator is used for extending streaming data based on lookups performed from database tables The solidDBEnrich operator is used for extending streaming data based on lookups performed from in-memory database tables The FileSource operator is used for reading data from files in formats such as csv, line, or binary The FileSink operator is used for writing data to files in formats such as csv, line, or binary The TCP / UDPSource operator is used for reading data from sockets in formats such as csv, line, or binary The TCP / UDPSink operator is used for writing data to sockets in formats such as csv, line, or binary Streams Core Adapter CapabilitiesStreams Built-in Adapters and DB Toolkit

  16. Extensibility • User-defined operators that extend the language • A reusable, generic operator model • written in general purpose programming languages (C++/Java) • User-defined functions that extend the language • Toolkits: Set of domain-specific operators/functions • Toolkits available as part of Streams • DB toolkit • Data mining toolkit • Financial toolkit • Streams Exchange on developerWorks • Re-usable Assets and Forum • Developers in two categories • Application developers • Toolkit developers

  17. Static vs. Dynamic Composition • Static connections • Fully specified at application development-time and do not change at run-time • Dynamic connections • Partially specified at application development-time (Name or Properties) • Established at run-time, as new jobs come and go • Specifications can also be updated at run-time • Dynamic application composition • Incremental deployment of applications • Dynamic adaptation of applications

  18. Static vs. Dynamic Composition • Static connections • Fully specified at application development-time and do not change at run-time • Dynamic connections • Partially specified at application development-time (Name or Properties) • Established at run-time, as new jobs come and go • Specifications can also be updated at run-time • Dynamic application composition • Incremental deployment of applications • Dynamic adaptation of applications

  19. InfoSphere Streams Runtime Architecture Eclipse IDE and Management Tools Language/OptimizingCompiler Admin Config / Console Management APIs InfoSphere Streams Runtime running on a cluster – 125 blades streamtool Running anywhere inside the cluster Streams Web Service Name Service Partition Service Name Service Root Service Scheduler Authorization and Authentication Service Streams Application Manager Streams Resource Manager Components running on management hosts Processing Element Container Agent Subset of a SPL application (a collection of operators) Host Controller Components running on application hosts

  20. InfoSphere Streams Runtime • Streams is a distributed, multi-user, multi-instance system • Multiple instances can run at the same time • Can run jobs from multiple users • A security model is provided for authentication and authorization • Application management • New jobs can be added/removed at any time • New and existing jobs can connect to each other • Scheduler assigns PEs to Hosts based on load • Resource management • Hosts & Services configuration and state • System & Application Metrics • Failure semantics • Recovery of management services state • PEs can be restarted or relocated upon failure • All connections will be re-established once a PE restarts • All state and in transit tuples are lost • Checkpointing can be used to restore operator state

  21. InfoSphere Streams Runtime - cont’d • Runs on commodity hardware • From single node to blade centers to high performance multi-rack clusters • Adapts to changes : X86 Host X86 Host X86 Host X86 Host X86 Host

  22. InfoSphere Streams Runtime – cont’d • Runs on commodity hardware • From single node to blade centers to high performance multi-rack clusters • Adapts to changes : • In workloads X86 Host X86 Host X86 Host X86 Host X86 Host

  23. InfoSphere Streams Runtime – cont’d • Runs on commodity hardware • From single node to blade centers to high performance multi-rack clusters • Adapts to changes : • In workloads X86 Host X86 Host X86 Host X86 Host X86 Host

  24. InfoSphere Streams Runtime – cont’d • Runs on commodity hardware • From single node to blade centers to high performance multi-rack clusters • Adapts to changes : • In workloads • In resources X86 Host X86 Host X86 Host X86 Host X86 Host

  25. InfoSphere Streams Runtime – cont’d • Runs on commodity hardware • From single node to blade centers to high performance multi-rack clusters • Adapts to changes : • In workloads • In resources X86 Host X86 Host X86 Host X86 Host X86 Host

  26. Streams Studio Eclipse IDE

  27. Streams Console – Metrics

  28. Agenda • Overview • Architecture • Customer Use Case

  29. Law Enforcement, Defense & Cyber Security • Real-time multimodal surveillance • Situational awareness • Cyber security detection Streaming Analytics in Action Stock Market • Impact of weather on securities prices • Analyze market data at ultra-low latencies Natural Systems • Wildfire management • Water management Transportation • Intelligent traffic management Fraud Prevention • Detecting multi-party fraud • Real time fraud prevention Manufacturing • Process control for microchip fabrication e-Science • Space weather prediction • Detection of transient events • Synchrotron atomic research Health & Life Sciences • Neonatal ICU monitoring • Epidemic early warning system • Remote healthcare monitoring Other • Smart Grid • Text analysis • Who’s talking to whom? • ERP for commodities • FPGA acceleration Telephony • CDR processing • Social analysis • Churn prediction • Geomapping

  30. Smarter Faster Cheaper CDR Processing 6 Billion CDRs per day, dedups over 7 days, processing latency from 12 hours to a few seconds 6 machines (using ½ processor capacity) InfoSphere Streams xDR Hub Key Requirements: Price/Performance and Scaling

  31. Database & Warehouse Telco: Beyond CDR processing, building on existing insight Call QualityAnalytics Call DataAnalytics Mobile Network ChurnAnalytics NetworkAnalytics Business Rules Customer Interactions Campaign Analytics AudioAnalytics … Analytics … Analytics LocationAnalytics Weather … Analytics Social Analytics Social Media InfoSphere Streams

  32. Surveillance and Physical Security: TerraEchos (Business Partner) • Use scenario • State-of-the-art covert surveillance system based on Streams platform • Acoustic signals from buried fiber optic cables are monitored, analyzed and reported in real time for necessary action • Currently designed to scale up to 1600 streams of raw binary data • Requirement • Real-time processing of multi-modal signals (acoustics. video, etc) • Easy to expand, dynamic • 3.5M data elements per second • Winner 2010 IBM CTO Innovation Award

  33. Cyber Security Analytics IT I/S Firewalls Live PacketCapture Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container InfoSphere Streams • DNS / DHCP / Netflow sources • Botnet Behavior modeling • External C&C Feeds (live DB queries) • Botnet nodes / Malware • IP/MAC identifying suspects Remediation Infrastructure / Ticketing 33

  34. University of Ontario Institute of Technology (UOIT) and Sick Kids Hospital IBM Data Baby http://youtu.be/ZiqY7p1v950 IBM Data Baby http://youtu.be/ZiqY7p1v950

  35. Intelligent Transportation • Multimodal Data Streams • GPS • Counts, speeds, travel times • Public Transport • Pollution measurements • Weather Conditions • Archiving of cleansed data • Real Time Traffic Monitoring • Real Time Traffic Information • (Multimodal) Travel Planner Only 4 x86 Blade servers to process 250,000 GPS probes per second Real Time Transformation Logic Real Time Geo Mapping Real Time Speed & Heading Estimation Real Time Aggregates & Statistics GPS Data Streams Storage adapters Interactive visualization Data Warehouse Web Server Offline statistical analysis Google Earth

  36. THINK 36

  37. Questions?

More Related