1 / 40

I don’t need a title slide for a lecture

I don’t need a title slide for a lecture. Long long ago, in a galaxy far, far away…. Outline. Background Data mining Association Rules Classification Clustering Sequential Patterns Sequence Similarity. Knowledge Discovery in Databases (KDD). What is it? Finding useful patterns in data

manuele
Download Presentation

I don’t need a title slide for a lecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I don’t need a title slide for a lecture Long long ago, in a galaxy far, far away…

  2. Outline • Background • Data mining • Association Rules • Classification • Clustering • Sequential Patterns • Sequence Similarity

  3. Knowledge Discovery in Databases (KDD) • What is it? • Finding useful patterns in data • Why do we need it? • Terabytes of data • Impractical to manually search for patterns • Where does data mining come in?

  4. Steps of a KDD process • Learn the application domain • Create a target dataset • Clean and preprocess data • Choose type of data mining • Pick an algorithm • Perform data mining • Interpret results

  5. Databases vs.Data warehousing • Data warehousing • Storage of all data • Details or summaries • Metadata • Data cleaning, integration • Databases • Queries over current data • Persistent storage • Atomic updates

  6. Databases provide for: Queries over current data Persistent storage Atomic updates Data warehouses provide for: Storage of all data Meta data Data cleaning, integration Fast access to data Databases vs.Data warehouses

  7. Who’s interested? • Databases - large amounts of data • Artificial Intelligence - search, planning, machine learning • Information Retrieval - searching for similar documents • Image Processing - finding similar images

  8. Types of data mining Association Rules Classification Clustering Sequential Patterns Sequence Similarity

  9. Association rules • What are they? • Looking for common causal relationships in basket data • Where are they used? • Store layout • Catalog design • Customer segmentation

  10. Association rules example Find all itemsets that occur at least twice, and the causal relationship of each

  11. Association rules metrics For a rule a b • support = a and b occur together in at least s% of the n baskets • confidence = of all of the baskets containing a, at least c% also contain b

  12. Association rules algorithms • Focus on finding support for “itemsets” • The naïve method: • Combine itemsets of size k-1 that differ only on the last item to find Candidatesk • Measure support of itemsets from step 1 to form large itemsetk • Increase k and repeat until no new large itemsets

  13. Itemsets of size 1 Looking for support of 2

  14. Finding candidate set 2

  15. Finding candidate set 3

  16. Apriori algorithm • An itemset cannot be a large itemset unless all of its subsets are large itemsets • Reduces number of candidate itemsets considered

  17. Research directions • Online construction of rules • CARMA (Berkeley) • Pre filtering the data • a posteriori (Limburgs Universitair Centrum)

  18. Classification • What is it? • Rules that partition data into separate groups. • Where is it used? • to classify people as good/bad credit risks • weather prediction • fraud detection • Variation: best k of n (who to send flyers to)

  19. Classification example

  20. Possible solutions • Bayesian classification • Neural networks • Genetic algorithms • Decision Trees

  21. Decision trees Salary < 25,000 no yes Graduate education? Accept no yes Accept Reject

  22. Decision trees • Build the tree in two steps • Build a perfect tree on sample data • At each node, pick a “good” attribute • Split data according to attribute • Recursively build tree on children • Prune the tree • Minimum Description Length • Cost of encoding tree structure • Cost of encoding split attribute • Cost of encoding leaf data records

  23. Research directions • Integrate building and pruning • PUBLIC (Bell Labs) • Incremental Updates • BOAT (University of Wisconsin)

  24. Clustering • What is it? • Given n points, separate them into k clusters • Where is it used? • Information retrieval - text classification • Identify similar web documents • Mapping the universe

  25. Clustering example

  26. Traditional clustering algorithms • Partitional • Determine k partitions that optimize a function • Common function is the “square error function” • Hierarchical • Each point starts as a cluster • Clusters are merged until k clusters remain

  27. Clustering difficulties

  28. Research directions • Higher dimension subspace clustering • CLIQUE (IBM Almaden) • Incremental clustering • Incremental DBScan (University of Munich) • Remove problems with outliers • CURE (Bell Labs)

  29. Sequential patterns • What is it? • Given a set of events, find frequently occurring patterns • Where is it used? • Analyzing basket data • Medical diagnosis

  30. Sequential patterns example

  31. AprioriAll • Create all large events that occur once • Map each subset to numbers • While there still are large itemsets: • Find candidate itemsets of length k • Find large itemsets of length k • Increase k

  32. Mapping the itemsets

  33. Research directions • Time limitations • WINEPI (Helsinki/Microsoft) • Itemsets over multiple transactions • CSP (IBM Almaden)

  34. Sequence Similarity • What is it? • Given a number of data sets, look for similar trends • Where is it used? • Find stocks with similar price movements • Find geological irregularities

  35. Example • Are the two sequences similar?

  36. Basic algorithm • Scale data • Match all gap-free sequences • Form pairs of large similar sequences • Find the longest common subsequence

  37. Research directions • Finding surprising patterns • IBM Almaden

  38. Data mining directions • Sampling • Fractals • Pre-partitioning data • Making data mining more accessible • User defined aggregation support

  39. References • General Data mining: http://www.almaden.ibm.com/cs/quest, www.bell-labs.com/project/serendip • Association Rules: “Fast Algorithms for Mining Association Rules”, Agrawal and Srikant; VLDB 94. • Classification: “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning”, Rastogi and Shim; VLDB 98.

  40. References (cont.) • Clustering: “CURE: An Efficient Clustering Algorithm for Large Databases”, Guha, Rastogi, Shim; SIGMOD 98. • Sequential Patterns: “Mining Sequential Patterns: Generalizations and Performance Improvements”, Srikant and Agrawal; EDBT 98. • Similarity Search: “Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Agrawal, Nin, Sawhney, and Shim; VLDB 95.

More Related