- By
**anika** - Follow User

- 228 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'AMCS/CS 340 : Data Mining' - anika

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Mining Data StreamsAMCS/CS 340 : Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology

Outline

- Introduction of Data Streams
- Synopsis/sketch maintenance
- Sampling
- Sliding window
- Counting Distinct Elements
- Frequent pattern mining
- Stream Clustering
- Stream Classification
- Change and novelty detection

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A large number of applications generate data streams

– Telecommunication (call records)

– System management (network events)

– Surveillance (sensor network, audio/video)

– Financial market (stock exchange)

– Day to day business (credit card, ATM transactions, etc)

Tasks: Real time query answering, statistics maintenance, and pattern discovery on data streams

Motivations3

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Characteristics

High volume (possibly infinite) of continuousdata

Data arrive at a rapid rate

Data distribution changes on the fly

The system cannot store the entire stream (only the summary of the data seen thus far)

calculations about the stream should be done in a limited amount of (secondary) memory

Data streams — continuous, ordered, changing, fast, huge amount

Characteristics of Data Streams4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

IP session data (collected using Cisco NetFlow)

AT&T collects 100 GBs of NetFlow data each day!

Example: Network Management ApplicationSource Destination DurationBytes Protocol

10.1.0.2 16.2.3.7 12 20K http

18.6.7.1 12.4.0.3 16 24K http

13.9.4.3 11.6.8.2 15 20K http

15.2.2.9 17.1.2.1 19 40K http

12.4.3.8 14.8.7.4 26 58K http

10.5.1.3 13.0.0.1 27 100K ftp

11.1.0.6 10.3.4.5 32 300K ftp

19.7.1.2 16.5.5.8 18 80K ftp

Network Operations

Center

Measurements

Alarms

5

Network

Data stream processing in network management

Monitor link bandwidth usage, estimate traffic demands

How many bytes were sent between a pair of IP addresses?

What fraction network IP addresses are active?

List the top 100 IP addresses in terms of traffic

Quickly detect faults, congestion and isolate root cause

List all sessions that transmitted more than 1000 bytes

Identify all sessions whose duration was more than twice the normal

List all IP addresses that have witnessed a sudden spike in traffic

Load balancing, improve utilization of network resources

Example: Network Management Application6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Mining query streams

Google wants to know what queries are more frequent today than yesterday.

Mining click streams

Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour.

Mining social network news feeds

Look for trending topics on Twitter, Facebook

Mining call records

Summarize telephone call records into customer bills.

More Applications7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Stream processing requirements

- Single pass: Each record is examined at most once
- Bounded storage: Limited Memory (M) for storing synopsis
- Real-time: Per record processing time (to maintain synopsis) must be low

Queries / Statistics /

Classification / Clustering

Processor

. . . 1, 5, 2, 7, 0, 9, 3

. . . a, r, v, t, y, h, b

. . . 0, 0, 1, 0, 1, 1, 0

time

Streams Entering

Output

Limited

Storage

8

Outline

- Introduction of Data Streams
- Synopsis/sketch maintenance
- Sampling
- Sliding window
- Counting Distinct Elements
- Frequent pattern mining
- Stream Clustering
- Stream Classification
- Change and novelty detection

9

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Can not store the entire stream ?

store a sample

Two different problems:

Sample a fixed proportion of elements in the stream (say 1 in 10)

Maintain a random sample of fixed size over a potentially infinite stream

Sampling from a data stream10

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A useful model of stream processing is that queries are about a window of length N--- the N most recent elements received.

q w e r t y u i o p a s d f g h j k l z x c v b n m

q w e r t y u i o p a s d f g h j k l z x c v b n m

q w e r t y u i o p a s d f g h j k l z x c v b n m

q w e r t y u i o p a s d f g h j k l z x c v b n m

Sliding Windows- Maintaining statistics

– Count/Sum of non-zero elements

– Variance

Past Future

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far.

Obvious approach: maintain the set of elements seen.

Application:

How many different Web pages does each customer request in a week?

How many different words are found among the Web pages being crawled at a site?

Unusually low or high numbers could indicate artificial pages (spam made for influence search engine rankings?).

Counting Distinct Elements12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Real Problem: what if we do not have space to store the complete set?

Estimate the count in an unbiased way.

Accept that the count may be in error, but limit the probability that the error is large.

Flajolet-Martin Approach [FM85]

Using Small Storage13

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Frequent Pattern Mining

14

- Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
- Frequent pattern mining: finding regularities in data

– What products were often purchased together?

– What are the subsequent purchases after buying a PC?

– What kinds of DNA are sensitive to this new drug?

– Can we classify web documents based on key-word combinations?

- Apriori Algorithm

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Frequent Pattern in Data Streams – Challenges

15

- Maintaining exact counts for all (frequent) itemsetsneeds multiple scans of the stream

– Maintain approximation of counts

- Finding the exact set of frequent itemsetsfrom data streams cannot be online

– Have to scan data streams multiple times

– Space overhead

– Finding approximation of set of frequent itemsets

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Approximate answers are often sufficient (e.g., trend/pattern analysis)

Example: a router is interested in all flows:

whose frequency is at least σ (e.g., 10%)of the entire traffic stream seen so far

and feels that 1/10 of σ (ε= 0.1*σ) error is comfortable

How to mine frequent patterns with good approximation?

Lossy Counting Algorithm (Manku & Motwani, VLDB’02)

Major ideas: not tracing items until it becomes frequent

Advantage: guaranteed error bound

Disadvantage: keep a large set of traces

Mining Approximate Frequent Patterns16

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Lossy Counting – Ideas

17

- Divide the stream into buckets, maintain a global count of buckets seen so far
- For any item, if its count is less than the global count of buckets, then its count does not need to be maintained

– How to divide buckets so that the possible errors are bounded?

– How to guarantee the number of entries needed to be recorded is also bounded?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bucket 2

Bucket 3

Lossy Counting for Frequent ItemsDivide Stream into ‘Buckets’ (bucket size is 1/ ε = 10)

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

(summary)

+

After a bucket, decrease all counters by 1

First Bucket of Stream19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

After a bucket, decrease all counters by 1

Next Bucket of Stream20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Given: (1) support threshold: σ, (2) error threshold: ε, and (3) stream length N

Output: items with frequency counts exceeding (σ–ε) N

How much do we undercount?

If stream length seen so far = N

and bucket-size = 1/ε

then frequency count error #buckets = εN

Approximation guarantee

No false negatives (freq but not reported)

False positives have true frequency count at least (σ–ε)N

Frequency count underestimated by at most εN

Approximation Guarantee21

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline

- Introduction of Data Streams
- Synopsis/sketch maintenance
- Sampling
- Sliding window
- Counting Distinct Elements
- Frequent pattern mining
- Stream Clustering
- Stream Classification
- Change and novelty detection

22

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

CluStream: Clustering On-line Streams

- Online micro-cluster maintenance
- Initial creation of q micro-clusters
- q is usually significantly larger than the number of natural clusters
- Online incremental update of micro-clusters
- If new point is within max-boundary, insert into the micro-cluster
- Otherwise, create a new cluster
- May delete obsolete micro-cluster or merge two closest ones
- Offline query-based macro-clustering
- Based on a user-specified time-horizon h and the number of macro-clusters K, compute macro-clusters using clustering algorithm, e.g. k-means, DbScan.

25

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

- • Clustering data stream with one scan and limited main memory
- – Clustering in a sliding window
- – Clustering the whole stream (online)
- • How to handle evolving data?
- – Online summarization and offline analysis
- – Change detection
- • Applications and extensions
- – Outlier detection, nearest neighbor search, reverse nearest neighbor queries, …

33

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline

- Introduction of Data Streams
- Synopsis/sketch maintenance
- Sampling
- Sliding window
- Counting Distinct Elements
- Frequent pattern mining
- Stream Clustering
- Stream Classification
- Change and novelty detection

34

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision tree induction for stream data classification

VFDT (Very Fast Decision Tree) / CVFDT (Domingos, Hulten, Spencer, KDD00/KDD01)

Is decision-tree good for modeling fast changing data, e.g., stock market analysis?

Other stream classification methods

Instead of decision-trees, consider other models

Naïve Bayesian

Ensemble(Wang, Fan, Yu, Han. KDD’03)

K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’04)

incremental updating, dynamic maintenance, and model construction

Classification for Dynamic Data Streams35

What are the Challenges?

36

- Data Volume

– impossible to mine the entire data at one time

– can only afford constant memory per data sample

- Concept Drifts

– previously learned models are invalid

- Cost of Learning

– model updates can be costly

– can only afford constant time per data sample

The Decision Tree Classifier

37

- Learning (Training) :

– Input: a data set of (X, y), where X is a vector, y a class label

– Output: a model (decision tree)

- Testing:

– Input: a test sample (x, ?)

– Output: a class label prediction for x

The Decision Tree Classifier

38

- A divide-and-conquer approach

– Simple algorithm, intuitive model

- Compute information gain for data in each node

– Super-linear complexity

- Typically a decision tree grows one level for each scan of data

– Multiple scans are required

- The data structure is not ‘stable’

– Subtle changes of data can cause global changes in the data structure

Challenge for streams

39

- Task:

– Given enough samples, can we build a tree in constant time that is nearly identical to the tree a batch learner (C4.5, Sprint, etc.) would build?

- Intuition:

– With increasing # of samples, the # of possible decision trees becomes smaller

- Forget about concept drifts for now.

Decision-Tree Induction with Data Streams

Packets > 10

Data Stream

yes

no

- At each node, we shall accumulate enough samples (n) before we make a split
- Problem: How many examples are necessary? n=?

Protocol = http

Packets > 10

Data Stream

yes

no

Bytes > 60K

Protocol = http

yes

Protocol = ftp

Ack. From Gehrke’s SIGMOD tutorial slides

Hoeffding Bound

41

- Given

– r : real valued random variable

– n : # independent observations of r

– R : range of r

- The difference between r and ravgis bounded by ε, with probability 1-δ,

P( |μr- ravg| ≥ ε) <1-δand

Hoeffding Bound

42

P( |μr - ravg| ≥ ε) < 1-δ and

- Properties:

– Hoeffding bound is independent of data distribution

– Error ε decreases when n (# of samples) increases

- Hoeffding Tree,
- based on Hoeffding Bound principle
- At each node, we shall accumulate enough samples (n) before we make a split
- When n is large enough, error ε decreases to a small value

Hoeffding Tree Input

S: sequence of examples

X: attributes

G( ): evaluation function, e.g., Gini gain

Hoeffding Tree Algorithm

for each example in S

retrieve G(Xa) and G(Xb) //two highest G(Xi)

if ( G(Xa) – G(Xb) > ε ) εis computed by using hoeffding bound

split on Xa

recursively go to next node

Hoeffding Tree Algorithm43

Hoeffding Tree: Pros and Cons

44

- Scales better than traditional DT algorithms

– Incremental

– Sub-linear with sampling

– Small memory requirement

- Cons:

– Only consider top 2 attributes

– Tie breaking takes time

– Grow a deep tree takes time

Outline

- Introduction of Data Streams
- Synopsis/sketch maintenance
- Sampling
- Sliding window
- Counting Distinct Elements
- Frequent pattern mining
- Stream Clustering
- Stream Classification
- Change and novelty detection

45

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Change detection

reference distribution

Kullback-Leibler distance can be used to measure the difference between two given distributions

[Dasu et al, 2006]

46

General idea: compare a reference distribution with a current window of events

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Change detection

reference distribution

Density statistic test can be used to test whether the newly observed data points S0 are sampled from the underlying distribution that produced the baseline data set S.

[Song et al, 2007]

47

General idea: compare a reference distribution with a current window of events

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Issue of reference window in change detection

- Based on the stationary reference data
- What if the underlying distribution is not stationary ?
- e.g. in network intrusion detection by monitoring network traffic, the distribution of reference data (usually normal data) is evolving over time

reference distribution

48

General idea: compare a reference distribution with a current window of events

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Stream data mining: A rich and on-going research field

Research in database community:

DSMS system architecture, continuous query processing, supporting mechanisms

Stream data mining

Powerful tools for finding general and unusual patterns

Effectiveness, efficiency and scalability: lots of open problems

Summary: Stream Data Mining49

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Data Streams, VLDB'03

C. C. Aggarwal, J. Han, J. Wang and P. S. Yu. On-Demand Classification of Evolving Data Streams, KDD'04

C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected Clustering of High Dimensional Data Streams, VLDB'04

S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD Record, 2001

B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues in Data Stream Systems”, PODS'02. (Conference tutorial)

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. "Multi-Dimensional Regression Analysis of Time-Series Data Streams, VLDB'02

P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00

A. Dobra, M. N. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD’02

J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over continuous data streams. SIGMOD'01

C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu. Mining frequent patterns in data streams at multiple time granularities, Kargupta, et al. (eds.), Next Generation Data Mining’02

MayurDatar, Aristides Gionis, PiotrIndyk, Rajeev Motwani. Maintaining stream statistics over sliding windows. SODA 2002

References on Stream Data Mining (1)S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data Streams, FOCS'00

G. Hulten, L. Spencer and P. Domingos: Mining time-changing data streams. KDD 2001

S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive Continuous Queries over Streams, SIGMOD02

G. Manku, R. Motwani. Approximate Frequency Counts over Data Streams, VLDB’02

A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams. ICDT'05

S. Muthukrishnan, Data streams: algorithms and applications, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003

R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ. Press, 1995

S. Viglas and J. Naughton, Rate-Based Query Optimization for Streaming Information Sources, SIGMOD’02

Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time, VLDB’02

H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept-Drifting Data Streams using Ensemble Classifiers, KDD'03

References on Stream Data Mining (2)Some of the material is borrowed from lectures of

Jiawei Han and MichelineKamber

Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi

Jure Leskovec and AnandRajaraman

Haixun Wang, Jian Pei, and Philip S. Yu

Acknowledgements52

Download Presentation

Connecting to Server..