1 / 32

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09. Preview. Introduction Challenge Method DMM framework Distance Function Selection Experiments Conclusion. Introduction. Mining stream for knowledge mining such as Clustering

benito
Download Presentation

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Data Streams with Periodically changing DistributionsYingying Tao, Tamer Ozsu CIKM’09

  2. Preview • Introduction • Challenge • Method • DMM framework • Distance Function Selection • Experiments • Conclusion

  3. Introduction • Mining stream for knowledge mining such as • Clustering • Classification • Frequent patterns discovery , has become important • Important characteristic of unbounded data stream is that the underlying distributions can show important changes over time, leading to dynamic data streams.

  4. Challenge • The problem of mining dynamic data streams • Balance accuracy with efficiency – highly accurate mining techniques are generally computationally expensive • Some question to ask: for dynamic data streams • Is the distribution changes entirely random and unpredictable • Is it possible for the distribution changes to follow certain patterns?

  5. Method • Propose a method for mining dynamic data streams • Where important observed distributions patterns are stored • And compared new detected changes with these patterns

  6. Method • Two streams with the same distribution • Mining results such as • List of all frequent items / itemset for frequent patterns discovery • Set of clusters / classes for clustering and classification should be the same • If distribution change is detected and a match is found • Possible to skip the re-mining process • Directly output the mining results for the archived distribution • This is called the match-and-reuse strategy

  7. Method • Issues to be resolved • Pattern selection • Selecting and storing important distributions that have a high probability of occurring in the future • Pattern representation • Storing each pattern succinctly • Matching • Efficient procedure for rapid data streams with high accuracy

  8. DMM framework • DMM framework stands for Detect, Match and Mine • Consist of four sequences • Choosing representative set • Change detection • Pattern matching • Stream mining • All processes are independent

  9. DMM framework • Window model • Generate reference window (choosing representative set) • Change detection • Distribution matching (Pattern matching) • Choosing important distribution

  10. Window model • Two windows on Stream S • Time-based • Defines the time intervals • Denoted by Wt • Called observation window • Implemented as a tumbling window • Moves forward at each clock tick

  11. Window model cont’s • Count-based • Contains a sub stream with fixed number of elements • Denoted by Wr • Called reference window • The size of the reference window (|Wr|) and time intervals of Wt are predefined values

  12. DMM framework • Window model • Generate reference window (choosing representative set) • Change detection • Distribution matching (Pattern matching) • Choosing important distribution

  13. Generate reference window (choosing representative set) • Wrstores a set of data elements that represents a current distribution of S • The size needs to be small due to memory limitations • Inaccurate results if we use a small data set to represent a distribution if the distribution is complicated • Due to this problem, use a dynamic reference window (Wr)

  14. Generate reference window (choosing representative set) • Merge and select process • Dynamic reference window (Wr) • Merge Wtand Wrto get a larger substream |Wr|+|Wt| • Select |Wr| elements from the merged window (Wr +Wt) and replace the stream in Wr by the new set • Merge and select process is triggered every time Wt tumbles

  15. Generate reference window (choosing representative set)

  16. Generate reference window (choosing representative set) • Selecting representative set: Two-step sampling approach • First-step sampling approach • Estimate the density function of Wr + Wt • K= kernel function • h= smoothing parameter (bandwidth) • si= data element in Wr + Wt

  17. Generate reference window (choosing representative set) • Selecting representative set: Two-step sampling approach • K is set to (Standard Gaussian function mean = 0 variance = 1) • Then the density function • h=value between 0 or 1

  18. Generate reference window (choosing representative set) • Selecting representative set: Two-step sampling approach • With the density function we are able to estimate the “shape” of the current distribution • X-axis is the = value of the data s (v(s)) in Wr +Wt • Y-axis is the probability (p(v)) for all data values

  19. Generate reference window (choosing representative set) • Selecting representative set: Two-step sampling approach • Second-step sampling approach

  20. Generate reference window (choosing representative set) • Selecting representative set: Two-step sampling approach • Second-step sampling approach • First calculate the start and end values for each partition

  21. DMM framework • Window model • Generate reference window (choosing representative set) • Change detection • Distribution matching (Pattern matching) • Choosing important distribution

  22. Change detection • Online change detection technique that is not restricted to specific stream processing application • Wt tumbles, the change detection procedure is triggered • Compare the distributions of substreams in both Wr and Wtwindows • If the distance is greater than the predefined maximum matching distance, then a distribution change is flagged

  23. DMM framework • Window model • Generate reference window (choosing representative set) • Change detection • Distribution matching (Pattern matching) • Choosing important distribution

  24. Distribution matching (Pattern matching) • We use the appropriate distance measure to check their similarity • If a match is found, then the persevered mining results are outputted • The maximum predefined maximum matching is important • Smaller implies a higher accuracy • Larger increases the possibility of a new distribution to match a pattern in the preserved set

  25. DMM framework • Window model • Generate reference window (choosing representative set) • Change detection • Distribution matching (Pattern matching) • Choosing important distribution

  26. Choosing important distribution • Use heuristic rules • Distribution that have occurred in the stream for more times are more important • The longer a distribution lasts in the streams lifespan the more important it is • Distribution that has mining results with higher accuracy is more important that a distributions with less accurate mining results

  27. Distance Function Selection • Dynamic Time Wrapping (DTW), Longest Common Subsequence (LCSS), Edit Distance on Real Sequence (EDR) and Relativized Discrepancy (RD) • A proper distance function that can be used with DMM • Efficient , with the ability to stretching

  28. Experiments • Change detection • Kernel density approach (KD) • Distance function-based approach(DF)

  29. Experiments • Distribution matching evaluation • Data from Tropical Atmosphere Ocean • Sea surface temperatures. • 12 218 streams each with a length of 962

  30. Experiments • Efficiency with and without DMM • Adopt a popular decision tree-based clustering technique VFDT to cluster the temperatures • Best decision tree generators for dynamic data streams • Time is reduced by 31.3%

  31. Conclusion • Introduced a DMM framework to mine dynamic data streams • Window model • Generate reference window (choosing representative set) • Change detection • Distribution matching (Pattern matching) • Choosing important distribution • Experiments that showing DMM performs better

  32. Thank you for your attention

More Related