stream based data management l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Stream-based Data Management PowerPoint Presentation
Download Presentation
Stream-based Data Management

Loading in 2 Seconds...

play fullscreen
1 / 29

Stream-based Data Management - PowerPoint PPT Presentation


  • 216 Views
  • Uploaded on

Stream-based Data Management. IS698 Min Song. Characteristics of Data Streams. Data Streams Data streams — continuous, ordered, changing, fast, huge amount Traditional DBMS — data stored in finite, persistent data sets Characteristics Huge volumes of continuous data, possibly infinite

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Stream-based Data Management' - van


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
characteristics of data streams
Characteristics of Data Streams
  • Data Streams
    • Data streams—continuous, ordered, changing, fast, huge amount
    • Traditional DBMS—data stored in finite, persistentdata sets
  • Characteristics
    • Huge volumes of continuous data, possibly infinite
    • Fast changing and requires fast, real-time response
    • Data stream captures nicely our data processing needs of today
    • Random access is expensive—single linear scan algorithm (can only have one look)
    • Store only the summary of the data seen thus far
    • Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing
stream data applications
Stream Data Applications
  • Telecommunication calling records
  • Business: credit card transaction flows
  • Network monitoring and traffic engineering
  • Financial market: stock exchange
  • Engineering & industrial processes: power supply & manufacturing
  • Sensor, monitoring & surveillance: video streams
  • Security monitoring
  • Web logs and Web page click streams
  • Massive data sets (even saved but random access is too expensive)
data streams vs data sets
Data Sets:Data Streams:

Updates infrequent

Data Streams vs. Data Sets
  • Data changed constantly (sometimes additions only)
  • Old data required many times
  • Mostly only freshest data used
  • Example: employees personal data table
  • Examples: financial tickers, data feeds from sensors, network monitoring, etc
using traditional database

Query

Result

Query

Result

Using Traditional Database

User/Application

Loader

data streams paradigm

Register Query

Result

Data Streams Paradigm

User/Application

Stream Query

Processor

data streams paradigm7

Register Query

Data

Stream

Management

System

(DSMS)

Scratch Space

(Memory and/or Disk)

Data Streams Paradigm

User/Application

Result

Stream Query

Processor

dbms versus dsms
Persistent relations

Transient streams (and persistent relations)

DBMS versus DSMS
dbms versus dsms9
Persistent relations

Transient streams (and persistent relations)

DBMS versus DSMS
  • One-time queries
  • Continuous queries
dbms versus dsms10
Persistent relations

Transient streams (and persistent relations)

DBMS versus DSMS
  • One-time queries
  • Continuous queries
  • Random access
  • Sequential access
dbms versus dsms11
Persistent relations

Transient streams (and persistent relations)

DBMS versus DSMS
  • One-time queries
  • Continuous queries
  • Random access
  • Sequential access
  • Access plan determined by query processor and physical DB design
  • Unpredictable data arrival and characteristics
dbms versus dsms12
Persistent relations

Transient streams (and persistent relations)

DBMS versus DSMS
  • One-time queries
  • Continuous queries
  • Random access
  • Sequential access
  • Access plan determined by query processor and physical DB design
  • Unpredictable data arrival and characteristics
  • “Unbounded” disk store
  • Bounded main memory
challenges of stream data processing
Challenges of Stream Data Processing
  • Multiple, continuous, rapid, time-varying, ordered streams
  • Main memory computations
  • Queries are often continuous
    • Evaluated continuously as stream data arrives
    • Answer updated over time
  • Queries are often complex
    • Beyond element-at-a-time processing
    • Beyond relational queries (scientific, data mining, OLAP)
  • Multi-level/multi-dimensional processing and data mining
    • Most stream data are at pretty low-level or multi-dimensional in nature
processing stream queries
Processing Stream Queries
  • Query types
    • One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)
    • Predefined query vs. ad-hoc query (issued on-line)
  • Unbounded memory requirements
    • For real-time response, main memory algorithm should be used
    • Memory requirement is unbounded if one will join future tuples
  • Approximate query answering
    • With bounded memory, it is not always possible to produce exact answers
    • High-quality approximate answers are desired
    • Data reduction and synopsis construction methods
      • Sketches, random sampling, histograms, wavelets, etc.
methods for approximate query answering
Methods for Approximate Query Answering
  • Sliding windows
    • Only over sliding windows of recent stream data
    • Approximation but often more desirable in applications
  • Batched processing, sampling and synopses
    • Batched if update is fast but computing is slow
      • Compute periodically, not very timely
    • Sampling if update is slow but computing is fast
      • Compute using sample data, but not good for joins, etc.
    • Synopsis data structures
      • Maintain a small synopsis or sketch of data
      • Good for querying historical data
  • Blocking operators, e.g., sorting, avg, min, etc.
    • Blocking if unable to produce the first output until seeing the entire input
projects on dsms data stream management system
Projects on DSMS (Data Stream Management System)
  • Research projects and system prototypes
    • STREAM (Stanford): A general-purpose DSMS
    • Cougar (Cornell): sensors
    • Aurora(Brown/MIT): sensor monitoring, dataflow
    • Hancock (AT&T): telecom streams
    • Niagara (OGI/Wisconsin): Internet XML databases
    • OpenCQ(Georgia Tech): triggers, incr. view maintenance
    • Tapestry (Xerox): pub/sub content-based filtering
    • Telegraph(Berkeley): adaptive engine for sensors
    • Tradebot(www.tradebot.com): stock tickers & streams
    • Tribeca (Bellcore): network monitoring
    • Streaminer (UIUC): new project for stream data mining
stream data mining vs stream querying
Stream Data Mining vs. Stream Querying
  • Stream mining—A more challenging task
    • It shares most of the difficulties with stream querying
    • Patterns are hidden and more general than querying
    • It may require exploratory analysis
      • Not necessarily continuous queries
  • Stream data mining tasks
    • Multi-dimensional on-line analysis of streams
    • Mining outliers and unusual patterns in stream data
    • Clustering data streams
    • Classification of stream data
challenges for mining unusual patterns in data streams
Challenges for Mining Unusual Patterns in Data Streams
  • Most stream data are at pretty low-level or multi-dimensional in nature: needs ML/MD processing
  • Analysis requirements
    • Multi-dimensional trends and unusual patterns
    • Capturing important changes at multi-dimensions/levels
    • Fast, real-time detection and response
summary
Summary
  • Stream data analysis: A rich and largely unexplored field
    • Current research focus in database community: DSMS system architecture, continuous query processing, supporting mechanisms
    • Stream data mining and stream OLAP analysis
      • Powerful tools for finding general and unusual patterns
      • Largely unexplored: current studies only touched the surface
  • Lots of exciting issues in further study
    • A promising one: Multi-level, multi-dimensional analysis and mining of stream data
what is a continuous query
What Is A Continuous Query ?

Query which is issued once and logically run continuously.

what is continuous query
What is Continuous Query ?

Query which is issued once and run continuously.

Example: detect abnormalities in network traffic behavior in real-time and their cause -- like link congestion due to hardware failure.

what is continuous query22
What is Continuous Query ?

Query which is issued once and run continuously.

More examples:

Continues queries used to support load balancing, online automatic trading at Stock Exchange

special challenges
Special Challenges
  • Timely online answers even for rapid data streams
  • Ability of fast access to large portions of data
  • Processing of multiple streams simultaneously
making things concrete

Central

Office

Central

Office

DSMS

Making Things Concrete

BOB

ALICE

Outgoing (call_ID, caller, time, event)

Incoming (call_ID, callee, time, event)

event = startorend

making things concrete25
Making Things Concrete
  • Database = two streams of mobile call records
    • Outgoing(connectionID, caller, start, end)
    • Incoming(connectionID, callee, start, end)
  • Query language = SQL

FROM clauses can refer to streams and/or relations

query 1 self join
Query 1 (self-join)

Find alloutgoing callslonger than2 minutes

SELECT O1.call_ID, O1.caller

FROM Outgoing O1, Outgoing O2

WHERE (O2.time – O1.time > 2

AND O1.call_ID = O2.call_ID

AND O1.event = start

AND O2.event = end)

  • Result requiresunbounded storage
  • Can provideresult as data stream
  • Can output after 2 min,without seeing end
query 2 join
Query 2 (join)

Pair upcallersand callees

SELECT O.caller, I.callee

FROM Outgoing O, Incoming I

WHERE O.call_ID = I.call_ID

  • Can still provideresult as data stream
  • Requiresunbounded temporary storage …
  • … unless streams arenear-synchronized
query 3 group by aggregation
Query 3 (group-by aggregation)

Total connection timefor each caller

SELECT O1.caller, sum(O2.time – O1.time)

FROM Outgoing O1, Outgoing O2

WHERE (O1.call_ID = O2.call_ID

AND O1.event = start

AND O2.event = end)

GROUP BY O1.caller

  • Cannot provide result in (append-only) stream.

Alternatives:

    • Output stream with updates
    • Provide current value on demand
    • Keep answer in memory
conclusions
Conclusions
  • Conventional DBMS technology is inadequate
  • We need reconsider all aspects of data management and processing in presence of data streams