1 / 58

Data Mining

Data Mining. Data Mining (DM)/ Knowledge Discovery in Databases (KDD). “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992]. Need for Data Mining. Increased ability to generate data Remote sensors and satellites

ranee
Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining

  2. Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992]

  3. Need for Data Mining • Increased ability to generate data • Remote sensors and satellites • Bar codes for commercial products • Computerization of businesses

  4. Need for Data Mining • Increased ability to store data • Media: bigger magnetic disks, CD-ROMs • Better database management systems • Data warehousing technology

  5. Need for Data Mining • Examples • Wal-Mart records 20,000,000 transactions/day • Healthcare transactions yield multi-GB databases • Mobil Oil exploration storing 100 terabytes • Human Genome Project, multi-GBs and increasing • Astronomical object catalogs, terabytes of images • NASA EOS, 1 terabyte/day

  6. Something for Everyone • Bell Atlantic • MCI • Land’s End • Visa • Bank of New York • FedEx

  7. Market Analysis and Management • Customer profiling • Data mining can tell you what types of customers buy what products (clustering or classification) or what products are often bought together (association rules). • Identifying customer requirements • Discover relationship between personal characteristics and probability of purchase • Discover correlations between purchases

  8. Fraud Detection and Management • Applications: • Widely used in health care, retail, credit card services, telecommunications, etc. • Approach: • Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. • Examples: • Auto Insurance • Money Laundering • Medical Insurance

  9. IBM Advertisement

  10. DM step in KDD Process

  11. Statistics Database AI Data Mining Hardware

  12. Mining Association Rules

  13. Mining Association Rules • Assocation rule mining: • Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses. • Applications: • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc. • Examples: • Rule form: “Body ® Head [support, confidence]”. • Buys=Diapers ® Buys=Beer [0.5%, 60%] • Major=CS ^ Class=DataMining ® Grade=A [1%, 75%]

  14. Rule Measures: Support and Confidence Customer buys both Customer buys diaper • Find all the rules X & Y  Z with minimum confidence and support • support, s, probability that a transaction contains {X, Y, Z} • confidence, c,conditional probability that a transaction having {X, Y} also contains Z. Customer buys beer • For minimum support 50%, minimum confidence 50%: • A  C (50%, 66.6%) • C  A (50%, 100%)

  15. Association Rule • Given • Set of items I = {i1, i2, .., im} • Set of transactions D • Each transaction T in D is a set of items • An association rule is an implication • X and Y are itemsets, • Rule meets minimum confidence c (c% of transactions in D which contain X contain Y) • A minimum support s is also met  È c X Y / X  È s X Y / D

  16. Measurement of rule strength in a transaction DB. A ® B [support, confidence] support = Prob(AÈ B) = confidence = Prob(B|A) = We are often interested in only strong associations, i.e. support ³ min_sup and confidence ³ min_conf. Examples. milk ® bread [5%, 60%]. tire Ù auto_accessories ® auto_services [2%, 80%]. #_of_trans_containing_all_the_items_in A È B total_#_of_trans #_of_trans_that_contain_both A and B #_of_trans_containing A Mining Strong Association Rules in Transaction DBs

  17. Methods for Mining Associations • Apriori • Partition Technique: • Sampling technique • Anti-Skew • Multi-level or generalized association • Constraint-based or query-based association

  18. Apriori (Levelwise) • Scan database multiple times • For ithscan, find all large itemsets of size i with min support • Use the large itemsets from scan i as input to scan i+1 • Create candidates, subsets of size i+1 which contain only large itemsets as subsets • Notation: Large k-itemset, Lk Set of candidate large itemsets of size k, Ck • Note: If {A,B} is not a large itemset, then no superset of it can be either.

  19. Mining Association Rules -- Example Min. support 50% Min. confidence 50% For rule AC: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% Apriori principle: Any subset of a frequent itemset must be frequent.

  20. L1 = {(A, 3), (B, 2), (C, 2), (D, 1), (E, 1), (F, 1)} Minsup = 0.25, Minconf = 0.5 C2 = {(A,B), (A,C), (A,D), (A,E), (A,F), (B,C), .., (E,F)} L2= {(A,B, 1), (A,C, 2), (A,D, 1), (B,C, 1), (B,E, 1), (B,F, 1), (E,F, 1)} C3 = {(A,B,C), (A,B,D), (A,C,D), (B,C,E), (B,C,F), (B,E,F)} L3 = {(A,B,C, 1), (B,E,F, 1)} C4 = {}, L4 = {}, End of program Possible Rules A=>B (c=.33,s=1), B=>A (c=.5,s=1), A=>C (c=.67,s=2), C=>A (c=1.0,s=2) A=>D (c=.33,s=1), D=>A (c=1.0,s=1), B=>C (c=.5,s=1), C=>B (.5,s=1), B=>E (c=.5,s=1), E=>B(c=1,s=1), B=>F (c=.5,s=1), F=>B(c=1,s=1) A=>B&C (c=.33,s=1), B=>A&C (c=.5,s=1), C=>A&B (c=.5,s=1),A&B=>C(c=1,s=1), A&C=>B (c=.5,s=1), B&C=>A (c=1,s=1), B=>E&F (c=.5,s=1),E=>B&F(c=1,s=1), F=>B&E (c=1,s=1), B&E=>F (c=1,s=1), B&F=>E(c=1,s=1),E&F=>B (c=1,s=1)

  21. Example

  22. Partitioning • Requires only two passes through external database • Divide database into n partitions, each fits in main memory • Scan 1: Process one partition in memory at a time, finding local large itemsets • Candidate large itemsets are the union of all local large itemsets (superset of actual large itemsets, contains false +) • Scan 2: Calculate support, determine actual large itemsets • If data is skewed, partitioning may not work well. The chance that a local large itemset is a global large itemset may be small.

  23. Partitioning • Will any large itemsets be missed? • If l  Li, then t1(l)/t1 < MS & t2(l)/t2 < MS & … & tn(l)/tn < MS thus t1(l) + t2(l) + … + tn(l) < MS * (t1 + t2 + … + tn)

  24. How do run times compare?

  25. PlayTennis Training Examples

  26. Association Rule Visualization: DBMiner

  27. DBMiner

  28. Association Rule Graph

  29. Clementine (UK, bought by SPSS) The Web Node shows the strength of associations in the data - i.e. how often field values coincide

  30. Multi-Level Association Food • A descendant of an infrequent itemset cannot be frequent • A transaction database can be encoded by dimensions and levels bread milk 2% white wheat skim Fraser Sunset

  31. Encoding Hierarchical Information in Transaction Database • A taxonomy for the relevant data items • Conversion of bar_code into generalized_item_id. food milk . . . bread 2% chocolate . . . . . . . . . old Mills Wonder Dairyland Foremost

  32. 1990 Milk and cereal selltogether! Mining Surprising Temporal Patterns Chakrabarti et al • Find prevalent rules that hold over large fractions of data • Useful for promotions and store arrangement • Intensively researched

  33. 1998 Zzzz... Prevalent != Interesting 1995 Milk and cereal selltogether! • Analysts already know about prevalent rules • Interesting rules are those that deviate from prior expectation • Mining’s payoff is in finding surprising phenomena Milk and cereal selltogether!

  34. Association Rules - Strengths & Weaknesses • Strengths • Understandable and easy to use • Useful • Weaknesses • Brute force methods can be expensive (memory and time) • Apriori is O(CD), where C = sum of sizes of candidates (2n possible, n = #items) D = size of database • Association does not necessarily imply correlation • Validation? • Maintenance?

  35. Clustering

  36. Clustering • Group similar items together • Example: sorting laundry • Similar items may have important attributes / functionality in common • Group customers together with similar interests and spending patterns • Form of unsupervised learning • Cluster objects into classes using rule: • Maximize intraclass similarity, minimize interclass similarity

  37. Clustering Techniques • Partition • Enumerate all partitions • Score by some criteria • K means • Hierarchical • Model based • Hypothesize model for each cluster • Find model that best fits data • AutoClass, Cobweb

  38. Clustering Goal • Suppose you transmit coordinates of points drawn randomly from this dataset • Only allowed 2 bits/point • What encoder/decoder will lose least information?

  39. Idea One • Break into grid • Decode each bit-pair as middle of each grid cell 00 01 10 11

  40. Idea Two • Break into grid • Decode each bit-pair as centroid of all data in the grid cell 00 01 11 10

  41. K Means Clustering • Ask user how many clusters (e.g., k=5)

  42. K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations

  43. K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center

  44. K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center • Each cluster finds new centroid of its points

  45. K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center • Each cluster finds new centroid of its points • Repeat until…

  46. K Means Issues • Computationally efficient • Initialization • Termination condition • Distance measure • What should k be?

  47. Hierarchical Clustering • Each point is its own cluster

  48. Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters

  49. Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters • Merge it into a parent cluster

  50. Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters • Merge it into a parent cluster • Repeat

More Related