1 / 45

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I. Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University. Data Mining Outline. EMM Stream Mining Text Mining Bioinformatics Mining. EMM Overview.

stamos
Download Presentation

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADVANCED TOPICS IN DATA MININGCSE 8331 Spring 2010Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University

  2. Data Mining Outline • EMM • Stream Mining • Text Mining • Bioinformatics Mining

  3. EMM Overview • Time Varying Discrete First Order Markov Model • Nodes are clusters of real world states. • Learning continues during prediction phase. • Learning: • Transition probabilities between nodes • Node labels (centroid of cluster) • Nodes are added and removed as data arrives

  4. MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: • S ={N1,N2, …, Nm}, and • A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij= <Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni).

  5. EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: • EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. • EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. • EMMDecrementalgorithm,which removes nodes from the EMM when needed.

  6. EMM Cluster • Find closest node to incoming event. • If none “close” create new node • Labeling of cluster is centroid of members in cluster • O(n)

  7. EMMSim • Find closest node to incoming event. • If none “close” create new node • Labeling of cluster is centroid/medoid of members in cluster • Problem • Nearest Neighbhor O(n) • BIRCH O(lg n) • Requires second phase to recluster initial

  8. 2/3 1/2 N3 2/3 N1 2/3 1/2 N3 1/3 1/1 N2 N1 N1 1/2 2/3 1/3 1/1 N2 1/3 N2 N1 1/3 N2 N3 1/1 1 N1 1/1 2/2 1/1 N1 EMM Increment <18,10,3,3,1,0,0> <17,10,2,3,1,0,0> <16,9,2,3,1,0,0> <14,8,2,3,1,0,0> <14,8,2,3,0,0,0> <18,10,3,3,1,1,0.>

  9. 1/3 1/3 1/3 1/6 1/6 N1 N1 N3 N3 1/3 2/2 1/3 1/6 N2 1/3 1/2 N6 N6 N5 N5 EMM Forget

  10. Data Mining Outline • EMM • Stream Mining • Data Stream Overview • Data Stream Modeling • Data Stream Clustering • TRAC-DS • Anomaly Detection • Text Mining • Bioinformatics Mining

  11. Motivation • A growing number of applications generate streams of data. • Computer network monitoring data • Call detail records in telecommunications (Cisco VoIP 2003) • Highway transportation traffic data (MnDot 2005) • Online web purchase log records (JCPenney 2003, Travelociy 2005) • Sensor network data (Ouse, Serwent 2002) • Stock exchange, transactions in retail chains, ATM operations in banks, credit card transactions. • Data mining techniques play a key role in data models in Data Stream Management System.

  12. Background Characteristics of data stream: • Data are raw • Records may at a rapid rate • High volume (possibly infinite) of continuous data • Concept drifts: Data distribution changes on the fly • Multidimensional • Temporality Stream processing restrictions: • Data modeling (synopsis) • Single pass: Each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time must be low Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Keogh, ICDM’04

  13. From Sensors to Streams • Data captured and sent by a set of sensors is usually referred to as “stream data”. • Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items • Stream data is infinite - the data keeps coming.

  14. Suppose There Were MANY Sensors • Traditional line graphs would be very difficult to read • Requirements for new visualization technique: • High level summary of data • Handle multiple sensors at once • Continuous • Temporal • Spatial

  15. Spatiotemporal Environment • Events arriving in a stream • At any time, t, we can view the state of the problem as represented by a vector of n numeric values: Vt = <S1t, S2t, ..., Snt> Time

  16. Data Stream Management Systems (DSMS) • Software to facilitate querying and managing stream data. • Retrieve the most recent information from the stream • Data aggregation facilitates merging together multiple streams • Modeling stream data to “summarize” stream • Visualization needed to observe in real-time the spatial and temporal patterns and trends hidden in the data.

  17. DSMS Problems • Stream Management development in state similar to that of databases prior to 1970’s • Each system/researcher looks at specific application or system • No standards concerning functionality • No standard query language • Unreasonable to expect end users will access raw data, data in the DSMS, or even data at a summarized view • Domain experts need to “see” a higher level of data

  18. Data Stream Modeling • Single pass: Each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time must be low • Summarization (Synopsis )of data • Use data NOT SAMPLE • Temporal and Spatial • Dynamic • Continuous (infinite stream) • Learn • Forget • Sublinear growth rate - Clustering 18

  19. Problem with Markov Chains • The required structure of the MC may not be certain at the model construction time. • As the real world being modeled by the MC changes, so should the structure of the MC. • Not scalable – grows linearly as number of events. • Markov Property • Our solution: • Extensible Markov Model (EMM) • Cluster real world events • Allow Markov chain to grow and shrink dynamically

  20. EMM Sublinear Growth Rate Minnesota Department of Transportation (MnDot)

  21. Traditional Clustering

  22. TRAC-DS

  23. Motivation • Temporal Ordering is a major feature of stream data. • Many stream applications depend on this ordering • Prediction of future values • Anomaly (rare event) detection • Concept drift

  24. Stream Clustering Requirements • Dynamic updating of the clusters • Identify outliers • Barbara: • compactness • fast • incremental processing

  25. Stream Clustering Algorithms • LOCALSEARCH • Partitions stream into segments • Clusters each segment individually by solving the k-medians problem • Iteratively reclusters the resulting centers • CluStream • Micro-clusters represented by summary statistics. • Micro-clusters are handled online • Micro-clusters merged offline • MONIC • Evolution of clusters over time • Cluster transitions over time

  26. TRAC-DS NOTE • TRAC-DS is not: • Another stream clustering algorithm • TRAC-DS is: • A new way of looking at clustering • Built on top of an existing clustering algorithm • TRAC-DS may be used with any stream clustering algorithm

  27. TRAC-DS Overview

  28. Data Stream Clustering • At each point in time a data stream clustering ζ is a partitioning of D', the data seen thus far. • Instead of the whole partitions C1, C2,..., Ck only synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time. • The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci.

  29. TRAC-DS Definition Given a data stream clustering ζ, a temporal relationship among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisfied: (1) There is a one-to-one correspondence between the clusters in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k. (3) The EMM M is created online together with the data stream clustering

  30. Clustering Operations A clustering operation is a function q : ζ × x → ζ which is used by the data stream clustering algorithm to up­date the clustering ζ given some additional information x which either is a new data point or other information (e.g., the number of the cluster to be deleted to be simplified the clustering).

  31. TRAC-DS Operations • A TRAC-DS operation is a function r : M × sc × y → M × sc that updates the temporal relationship among clusters represented by the EMM M with states S given a current state sc ∈ S and additional information y and returns an updated EMM and possibly a new current state. • In order to be able to dynamically update the EMM M we need to store a transition count matrix C. The count cij in C contains the number of times we observed a new point being assigned by the clustering algorithm to cluster i followed by a point being assigned to cluster j.

  32. Stream Clustering Operations * • qassign point(ζ,x): Assigns the new data point x to an existing cluster. • qnew cluster(ζ,x): Create a new cluster. • qremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one. • qmerge clusters(ζ,x): Merges two clusters. • qfade clusters(ζ,x): Fades the cluster structure. • qsplit clusters(ζ,x): Splits a cluster. * Inspired by MONIC

  33. TRAC-DS Operations • rassign point(M,sc,y): Assigns the new data point to the state representing an existing cluster • rnew cluster(M,sc,y): Create a state for a new cluster. • rremove cluster(M,sc,y): Removes state. • rmerge clusters(M,sc,y): Merges two states. • rfade clusters(M,sc,y): Fades the transition probabilities using an exponential decay f(t)=2−λt • rsplit clusters(M,sc,y): Splits states. Y clustering operations.

  34. TRAC-DS Example

  35. TRAC-DS Advantages • Dynamic • Flexible – • Use any Clustering Algorithm • Supports and clustering operations • Scalable • Merges Clustering & Markov Modeling

  36. What is Anomaly? • Event that is unusual • Event that doesn’t occur frequently • Predefined event • What is unusual? • What is deviation?

  37. What is Anomaly in Stream Data? • Rare - Anomalous – Surprising • Out of the ordinary • Not outlier detection • No knowledge of data distribution • Data is not static • Must take temporal and spatial values into account • May be interested in sequence of events • Ex: Snow in upstate New York is not an anomaly • Snow in upstate New York in June is rare • Rare events may change over time

  38. Statistical View of Anomaly • Outlier • Data item that is outside the normal distribution of the data • Identify by Box Plot Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

  39. Statistical View of Anomaly • Identify by looking at distribution • THIS DOES NOT WORK with stream data Image from www.wikipedia.org, Normal distribution.

  40. Data Mining View of Anomaly • Classification Problem • Build classifier from training data • Problem is that training data shows what is NOT an anomaly • Thus an anomaly is anything that is not viewed as normal by the classification technique • MUST build dynamic classifier • Identify anomalous behavior • Signatures of what anomalous behavior looks like • Input data is identified as anomaly if it is similar enough to one of these signatures • Mixed – Classification and Signature

  41. EMM Advantages • Dynamic • Adaptable • Use of clustering • Learns rare event • Scalable: • Growth of EMM is not linear on size of data. • Hierarchical feature of EMM • Creation/evaluation quasi-real time • Distributed / Hierarchical extensions

  42. Growth of EMM Servent Data

  43. TRAC-DS Approach to Detect Anomalies • By learning what is normal, the model can predict what is not • Normal is based on likelihood of occurrence • Use TRAC-DS to build clusters and behavior between clusters • We view a rare event as: • Unusual event • Transition between events states which does not frequently occur. • Continue learning

  44. Determining Rare • Occurrence Frequency (OFi) of an EMM state Si is normalized count of state: • Normalized Transition Probability (NTPmn), from one state, Sm, to another, Sn, is a normalized transition Count:

  45. EMMRare • EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs: • The frequency of the node at time t+1 is below this threshold • The updated transition probability of the MC transition from node at time t to the node at t+1 is below the threshold

More Related