400 likes | 405 Views
I don’t need a title slide for a lecture. Long long ago, in a galaxy far, far away…. Outline. Background Data mining Association Rules Classification Clustering Sequential Patterns Sequence Similarity. Knowledge Discovery in Databases (KDD). What is it? Finding useful patterns in data
E N D
I don’t need a title slide for a lecture Long long ago, in a galaxy far, far away…
Outline • Background • Data mining • Association Rules • Classification • Clustering • Sequential Patterns • Sequence Similarity
Knowledge Discovery in Databases (KDD) • What is it? • Finding useful patterns in data • Why do we need it? • Terabytes of data • Impractical to manually search for patterns • Where does data mining come in?
Steps of a KDD process • Learn the application domain • Create a target dataset • Clean and preprocess data • Choose type of data mining • Pick an algorithm • Perform data mining • Interpret results
Databases vs.Data warehousing • Data warehousing • Storage of all data • Details or summaries • Metadata • Data cleaning, integration • Databases • Queries over current data • Persistent storage • Atomic updates
Databases provide for: Queries over current data Persistent storage Atomic updates Data warehouses provide for: Storage of all data Meta data Data cleaning, integration Fast access to data Databases vs.Data warehouses
Who’s interested? • Databases - large amounts of data • Artificial Intelligence - search, planning, machine learning • Information Retrieval - searching for similar documents • Image Processing - finding similar images
Types of data mining Association Rules Classification Clustering Sequential Patterns Sequence Similarity
Association rules • What are they? • Looking for common causal relationships in basket data • Where are they used? • Store layout • Catalog design • Customer segmentation
Association rules example Find all itemsets that occur at least twice, and the causal relationship of each
Association rules metrics For a rule a b • support = a and b occur together in at least s% of the n baskets • confidence = of all of the baskets containing a, at least c% also contain b
Association rules algorithms • Focus on finding support for “itemsets” • The naïve method: • Combine itemsets of size k-1 that differ only on the last item to find Candidatesk • Measure support of itemsets from step 1 to form large itemsetk • Increase k and repeat until no new large itemsets
Itemsets of size 1 Looking for support of 2
Apriori algorithm • An itemset cannot be a large itemset unless all of its subsets are large itemsets • Reduces number of candidate itemsets considered
Research directions • Online construction of rules • CARMA (Berkeley) • Pre filtering the data • a posteriori (Limburgs Universitair Centrum)
Classification • What is it? • Rules that partition data into separate groups. • Where is it used? • to classify people as good/bad credit risks • weather prediction • fraud detection • Variation: best k of n (who to send flyers to)
Possible solutions • Bayesian classification • Neural networks • Genetic algorithms • Decision Trees
Decision trees Salary < 25,000 no yes Graduate education? Accept no yes Accept Reject
Decision trees • Build the tree in two steps • Build a perfect tree on sample data • At each node, pick a “good” attribute • Split data according to attribute • Recursively build tree on children • Prune the tree • Minimum Description Length • Cost of encoding tree structure • Cost of encoding split attribute • Cost of encoding leaf data records
Research directions • Integrate building and pruning • PUBLIC (Bell Labs) • Incremental Updates • BOAT (University of Wisconsin)
Clustering • What is it? • Given n points, separate them into k clusters • Where is it used? • Information retrieval - text classification • Identify similar web documents • Mapping the universe
Traditional clustering algorithms • Partitional • Determine k partitions that optimize a function • Common function is the “square error function” • Hierarchical • Each point starts as a cluster • Clusters are merged until k clusters remain
Research directions • Higher dimension subspace clustering • CLIQUE (IBM Almaden) • Incremental clustering • Incremental DBScan (University of Munich) • Remove problems with outliers • CURE (Bell Labs)
Sequential patterns • What is it? • Given a set of events, find frequently occurring patterns • Where is it used? • Analyzing basket data • Medical diagnosis
AprioriAll • Create all large events that occur once • Map each subset to numbers • While there still are large itemsets: • Find candidate itemsets of length k • Find large itemsets of length k • Increase k
Research directions • Time limitations • WINEPI (Helsinki/Microsoft) • Itemsets over multiple transactions • CSP (IBM Almaden)
Sequence Similarity • What is it? • Given a number of data sets, look for similar trends • Where is it used? • Find stocks with similar price movements • Find geological irregularities
Example • Are the two sequences similar?
Basic algorithm • Scale data • Match all gap-free sequences • Form pairs of large similar sequences • Find the longest common subsequence
Research directions • Finding surprising patterns • IBM Almaden
Data mining directions • Sampling • Fractals • Pre-partitioning data • Making data mining more accessible • User defined aggregation support
References • General Data mining: http://www.almaden.ibm.com/cs/quest, www.bell-labs.com/project/serendip • Association Rules: “Fast Algorithms for Mining Association Rules”, Agrawal and Srikant; VLDB 94. • Classification: “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning”, Rastogi and Shim; VLDB 98.
References (cont.) • Clustering: “CURE: An Efficient Clustering Algorithm for Large Databases”, Guha, Rastogi, Shim; SIGMOD 98. • Sequential Patterns: “Mining Sequential Patterns: Generalizations and Performance Improvements”, Srikant and Agrawal; EDBT 98. • Similarity Search: “Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Agrawal, Nin, Sawhney, and Shim; VLDB 95.