ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

ADVANCED TOPICS IN DATA MININGCSE 8331 Spring 2010Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University

Data Mining Outline • EMM • Stream Mining • Text Mining • Bioinformatics Mining

EMM Overview • Time Varying Discrete First Order Markov Model • Nodes are clusters of real world states. • Learning continues during prediction phase. • Learning: • Transition probabilities between nodes • Node labels (centroid of cluster) • Nodes are added and removed as data arrives

MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: • S ={N1,N2, …, Nm}, and • A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij= <Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni).

EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: • EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. • EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. • EMMDecrementalgorithm,which removes nodes from the EMM when needed.

EMM Cluster • Find closest node to incoming event. • If none “close” create new node • Labeling of cluster is centroid of members in cluster • O(n)

EMMSim • Find closest node to incoming event. • If none “close” create new node • Labeling of cluster is centroid/medoid of members in cluster • Problem • Nearest Neighbhor O(n) • BIRCH O(lg n) • Requires second phase to recluster initial

2/3 1/2 N3 2/3 N1 2/3 1/2 N3 1/3 1/1 N2 N1 N1 1/2 2/3 1/3 1/1 N2 1/3 N2 N1 1/3 N2 N3 1/1 1 N1 1/1 2/2 1/1 N1 EMM Increment <18,10,3,3,1,0,0> <17,10,2,3,1,0,0> <16,9,2,3,1,0,0> <14,8,2,3,1,0,0> <14,8,2,3,0,0,0> <18,10,3,3,1,1,0.>

1/3 1/3 1/3 1/6 1/6 N1 N1 N3 N3 1/3 2/2 1/3 1/6 N2 1/3 1/2 N6 N6 N5 N5 EMM Forget

Data Mining Outline • EMM • Stream Mining • Data Stream Overview • Data Stream Modeling • Data Stream Clustering • TRAC-DS • Anomaly Detection • Text Mining • Bioinformatics Mining

Motivation • A growing number of applications generate streams of data. • Computer network monitoring data • Call detail records in telecommunications (Cisco VoIP 2003) • Highway transportation traffic data (MnDot 2005) • Online web purchase log records (JCPenney 2003, Travelociy 2005) • Sensor network data (Ouse, Serwent 2002) • Stock exchange, transactions in retail chains, ATM operations in banks, credit card transactions. • Data mining techniques play a key role in data models in Data Stream Management System.

Background Characteristics of data stream: • Data are raw • Records may at a rapid rate • High volume (possibly infinite) of continuous data • Concept drifts: Data distribution changes on the fly • Multidimensional • Temporality Stream processing restrictions: • Data modeling (synopsis) • Single pass: Each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time must be low Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Keogh, ICDM’04

From Sensors to Streams • Data captured and sent by a set of sensors is usually referred to as “stream data”. • Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items • Stream data is infinite - the data keeps coming.

Suppose There Were MANY Sensors • Traditional line graphs would be very difficult to read • Requirements for new visualization technique: • High level summary of data • Handle multiple sensors at once • Continuous • Temporal • Spatial

Spatiotemporal Environment • Events arriving in a stream • At any time, t, we can view the state of the problem as represented by a vector of n numeric values: Vt = <S1t, S2t, ..., Snt> Time

Data Stream Management Systems (DSMS) • Software to facilitate querying and managing stream data. • Retrieve the most recent information from the stream • Data aggregation facilitates merging together multiple streams • Modeling stream data to “summarize” stream • Visualization needed to observe in real-time the spatial and temporal patterns and trends hidden in the data.

DSMS Problems • Stream Management development in state similar to that of databases prior to 1970’s • Each system/researcher looks at specific application or system • No standards concerning functionality • No standard query language • Unreasonable to expect end users will access raw data, data in the DSMS, or even data at a summarized view • Domain experts need to “see” a higher level of data

Data Stream Modeling • Single pass: Each record is examined at most once • Bounded storage: Limited Memory for storing synopsis • Real-time: Per record processing time must be low • Summarization (Synopsis )of data • Use data NOT SAMPLE • Temporal and Spatial • Dynamic • Continuous (infinite stream) • Learn • Forget • Sublinear growth rate - Clustering 18

Problem with Markov Chains • The required structure of the MC may not be certain at the model construction time. • As the real world being modeled by the MC changes, so should the structure of the MC. • Not scalable – grows linearly as number of events. • Markov Property • Our solution: • Extensible Markov Model (EMM) • Cluster real world events • Allow Markov chain to grow and shrink dynamically

EMM Sublinear Growth Rate Minnesota Department of Transportation (MnDot)

Traditional Clustering

TRAC-DS

Motivation • Temporal Ordering is a major feature of stream data. • Many stream applications depend on this ordering • Prediction of future values • Anomaly (rare event) detection • Concept drift

Stream Clustering Requirements • Dynamic updating of the clusters • Identify outliers • Barbara: • compactness • fast • incremental processing

Stream Clustering Algorithms • LOCALSEARCH • Partitions stream into segments • Clusters each segment individually by solving the k-medians problem • Iteratively reclusters the resulting centers • CluStream • Micro-clusters represented by summary statistics. • Micro-clusters are handled online • Micro-clusters merged offline • MONIC • Evolution of clusters over time • Cluster transitions over time

TRAC-DS NOTE • TRAC-DS is not: • Another stream clustering algorithm • TRAC-DS is: • A new way of looking at clustering • Built on top of an existing clustering algorithm • TRAC-DS may be used with any stream clustering algorithm

TRAC-DS Overview

Data Stream Clustering • At each point in time a data stream clustering ζ is a partitioning of D', the data seen thus far. • Instead of the whole partitions C1, C2,..., Ck only synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time. • The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci.

TRAC-DS Definition Given a data stream clustering ζ, a temporal relationship among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisﬁed: (1) There is a one-to-one correspondence between the clusters in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k. (3) The EMM M is created online together with the data stream clustering

Clustering Operations A clustering operation is a function q : ζ × x → ζ which is used by the data stream clustering algorithm to update the clustering ζ given some additional information x which either is a new data point or other information (e.g., the number of the cluster to be deleted to be simpliﬁed the clustering).

TRAC-DS Operations • A TRAC-DS operation is a function r : M × sc × y → M × sc that updates the temporal relationship among clusters represented by the EMM M with states S given a current state sc ∈ S and additional information y and returns an updated EMM and possibly a new current state. • In order to be able to dynamically update the EMM M we need to store a transition count matrix C. The count cij in C contains the number of times we observed a new point being assigned by the clustering algorithm to cluster i followed by a point being assigned to cluster j.

Stream Clustering Operations * • qassign point(ζ,x): Assigns the new data point x to an existing cluster. • qnew cluster(ζ,x): Create a new cluster. • qremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one. • qmerge clusters(ζ,x): Merges two clusters. • qfade clusters(ζ,x): Fades the cluster structure. • qsplit clusters(ζ,x): Splits a cluster. * Inspired by MONIC

TRAC-DS Operations • rassign point(M,sc,y): Assigns the new data point to the state representing an existing cluster • rnew cluster(M,sc,y): Create a state for a new cluster. • rremove cluster(M,sc,y): Removes state. • rmerge clusters(M,sc,y): Merges two states. • rfade clusters(M,sc,y): Fades the transition probabilities using an exponential decay f(t)=2−λt • rsplit clusters(M,sc,y): Splits states. Y clustering operations.

TRAC-DS Example

TRAC-DS Advantages • Dynamic • Flexible – • Use any Clustering Algorithm • Supports and clustering operations • Scalable • Merges Clustering & Markov Modeling

What is Anomaly? • Event that is unusual • Event that doesn’t occur frequently • Predefined event • What is unusual? • What is deviation?

What is Anomaly in Stream Data? • Rare - Anomalous – Surprising • Out of the ordinary • Not outlier detection • No knowledge of data distribution • Data is not static • Must take temporal and spatial values into account • May be interested in sequence of events • Ex: Snow in upstate New York is not an anomaly • Snow in upstate New York in June is rare • Rare events may change over time

Statistical View of Anomaly • Outlier • Data item that is outside the normal distribution of the data • Identify by Box Plot Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

Statistical View of Anomaly • Identify by looking at distribution • THIS DOES NOT WORK with stream data Image from www.wikipedia.org, Normal distribution.

Data Mining View of Anomaly • Classification Problem • Build classifier from training data • Problem is that training data shows what is NOT an anomaly • Thus an anomaly is anything that is not viewed as normal by the classification technique • MUST build dynamic classifier • Identify anomalous behavior • Signatures of what anomalous behavior looks like • Input data is identified as anomaly if it is similar enough to one of these signatures • Mixed – Classification and Signature

EMM Advantages • Dynamic • Adaptable • Use of clustering • Learns rare event • Scalable: • Growth of EMM is not linear on size of data. • Hierarchical feature of EMM • Creation/evaluation quasi-real time • Distributed / Hierarchical extensions

Growth of EMM Servent Data

TRAC-DS Approach to Detect Anomalies • By learning what is normal, the model can predict what is not • Normal is based on likelihood of occurrence • Use TRAC-DS to build clusters and behavior between clusters • We view a rare event as: • Unusual event • Transition between events states which does not frequently occur. • Continue learning

Determining Rare • Occurrence Frequency (OFi) of an EMM state Si is normalized count of state: • Normalized Transition Probability (NTPmn), from one state, Sm, to another, Sn, is a normalized transition Count:

EMMRare • EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs: • The frequency of the node at time t+1 is below this threshold • The updated transition probability of the MC transition from node at time t to the node at t+1 is below the threshold

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Presentation Transcript

Advanced Topics in Data Mining: Web Mining

DATA MINING Introductory and Advanced Topics Part II

Advanced Topics in Data Mining Special focus: Social Networks

Java Advanced Topics CSE 422

Advanced Topics in Data Mining

Advanced Topics in Data Mining: Association Rules

Advanced Topics in Data Mining: Sequential Patterns

Special Topics in Data Mining

CSE 8331 Spring 2010 Image Mining

Advanced Topics in Data Mining Special focus: Social Networks

Advanced Topics in Data Management

CSE 881: Data Mining

CSE 881: Data Mining

CSE 592: Data Mining

CSE 881: Data Mining

CSE 326: Data Structures: Advanced Topics

CSE 980: Data Mining

CSE 8392 SPRING 1999 DATA MINING: PART I

CSE 8392 SPRING 1999 DATA MINING: ADVANCED TOPICS Temporal Data

Java Advanced Topics CSE 422