580 likes | 700 Views
Data mining, also known as Knowledge Discovery in Databases (KDD), involves extracting valuable information from vast amounts of data. The rise of data generation from sources like electronic transactions and remote sensors has sparked the need for effective data analysis techniques. Applications range from market analysis and fraud detection to customer profiling. Key methods like association rule mining reveal correlations among data points. This comprehensive overview covers the necessity of data mining, its applications in various industries, and essential techniques used in the process.
E N D
Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992]
Need for Data Mining • Increased ability to generate data • Remote sensors and satellites • Bar codes for commercial products • Computerization of businesses
Need for Data Mining • Increased ability to store data • Media: bigger magnetic disks, CD-ROMs • Better database management systems • Data warehousing technology
Need for Data Mining • Examples • Wal-Mart records 20,000,000 transactions/day • Healthcare transactions yield multi-GB databases • Mobil Oil exploration storing 100 terabytes • Human Genome Project, multi-GBs and increasing • Astronomical object catalogs, terabytes of images • NASA EOS, 1 terabyte/day
Something for Everyone • Bell Atlantic • MCI • Land’s End • Visa • Bank of New York • FedEx
Market Analysis and Management • Customer profiling • Data mining can tell you what types of customers buy what products (clustering or classification) or what products are often bought together (association rules). • Identifying customer requirements • Discover relationship between personal characteristics and probability of purchase • Discover correlations between purchases
Fraud Detection and Management • Applications: • Widely used in health care, retail, credit card services, telecommunications, etc. • Approach: • Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. • Examples: • Auto Insurance • Money Laundering • Medical Insurance
Statistics Database AI Data Mining Hardware
Mining Association Rules • Assocation rule mining: • Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses. • Applications: • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc. • Examples: • Rule form: “Body ® Head [support, confidence]”. • Buys=Diapers ® Buys=Beer [0.5%, 60%] • Major=CS ^ Class=DataMining ® Grade=A [1%, 75%]
Rule Measures: Support and Confidence Customer buys both Customer buys diaper • Find all the rules X & Y Z with minimum confidence and support • support, s, probability that a transaction contains {X, Y, Z} • confidence, c,conditional probability that a transaction having {X, Y} also contains Z. Customer buys beer • For minimum support 50%, minimum confidence 50%: • A C (50%, 66.6%) • C A (50%, 100%)
Association Rule • Given • Set of items I = {i1, i2, .., im} • Set of transactions D • Each transaction T in D is a set of items • An association rule is an implication • X and Y are itemsets, • Rule meets minimum confidence c (c% of transactions in D which contain X contain Y) • A minimum support s is also met È c X Y / X È s X Y / D
Measurement of rule strength in a transaction DB. A ® B [support, confidence] support = Prob(AÈ B) = confidence = Prob(B|A) = We are often interested in only strong associations, i.e. support ³ min_sup and confidence ³ min_conf. Examples. milk ® bread [5%, 60%]. tire Ù auto_accessories ® auto_services [2%, 80%]. #_of_trans_containing_all_the_items_in A È B total_#_of_trans #_of_trans_that_contain_both A and B #_of_trans_containing A Mining Strong Association Rules in Transaction DBs
Methods for Mining Associations • Apriori • Partition Technique: • Sampling technique • Anti-Skew • Multi-level or generalized association • Constraint-based or query-based association
Apriori (Levelwise) • Scan database multiple times • For ithscan, find all large itemsets of size i with min support • Use the large itemsets from scan i as input to scan i+1 • Create candidates, subsets of size i+1 which contain only large itemsets as subsets • Notation: Large k-itemset, Lk Set of candidate large itemsets of size k, Ck • Note: If {A,B} is not a large itemset, then no superset of it can be either.
Mining Association Rules -- Example Min. support 50% Min. confidence 50% For rule AC: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% Apriori principle: Any subset of a frequent itemset must be frequent.
L1 = {(A, 3), (B, 2), (C, 2), (D, 1), (E, 1), (F, 1)} Minsup = 0.25, Minconf = 0.5 C2 = {(A,B), (A,C), (A,D), (A,E), (A,F), (B,C), .., (E,F)} L2= {(A,B, 1), (A,C, 2), (A,D, 1), (B,C, 1), (B,E, 1), (B,F, 1), (E,F, 1)} C3 = {(A,B,C), (A,B,D), (A,C,D), (B,C,E), (B,C,F), (B,E,F)} L3 = {(A,B,C, 1), (B,E,F, 1)} C4 = {}, L4 = {}, End of program Possible Rules A=>B (c=.33,s=1), B=>A (c=.5,s=1), A=>C (c=.67,s=2), C=>A (c=1.0,s=2) A=>D (c=.33,s=1), D=>A (c=1.0,s=1), B=>C (c=.5,s=1), C=>B (.5,s=1), B=>E (c=.5,s=1), E=>B(c=1,s=1), B=>F (c=.5,s=1), F=>B(c=1,s=1) A=>B&C (c=.33,s=1), B=>A&C (c=.5,s=1), C=>A&B (c=.5,s=1),A&B=>C(c=1,s=1), A&C=>B (c=.5,s=1), B&C=>A (c=1,s=1), B=>E&F (c=.5,s=1),E=>B&F(c=1,s=1), F=>B&E (c=1,s=1), B&E=>F (c=1,s=1), B&F=>E(c=1,s=1),E&F=>B (c=1,s=1)
Partitioning • Requires only two passes through external database • Divide database into n partitions, each fits in main memory • Scan 1: Process one partition in memory at a time, finding local large itemsets • Candidate large itemsets are the union of all local large itemsets (superset of actual large itemsets, contains false +) • Scan 2: Calculate support, determine actual large itemsets • If data is skewed, partitioning may not work well. The chance that a local large itemset is a global large itemset may be small.
Partitioning • Will any large itemsets be missed? • If l Li, then t1(l)/t1 < MS & t2(l)/t2 < MS & … & tn(l)/tn < MS thus t1(l) + t2(l) + … + tn(l) < MS * (t1 + t2 + … + tn)
Clementine (UK, bought by SPSS) The Web Node shows the strength of associations in the data - i.e. how often field values coincide
Multi-Level Association Food • A descendant of an infrequent itemset cannot be frequent • A transaction database can be encoded by dimensions and levels bread milk 2% white wheat skim Fraser Sunset
Encoding Hierarchical Information in Transaction Database • A taxonomy for the relevant data items • Conversion of bar_code into generalized_item_id. food milk . . . bread 2% chocolate . . . . . . . . . old Mills Wonder Dairyland Foremost
1990 Milk and cereal selltogether! Mining Surprising Temporal Patterns Chakrabarti et al • Find prevalent rules that hold over large fractions of data • Useful for promotions and store arrangement • Intensively researched
1998 Zzzz... Prevalent != Interesting 1995 Milk and cereal selltogether! • Analysts already know about prevalent rules • Interesting rules are those that deviate from prior expectation • Mining’s payoff is in finding surprising phenomena Milk and cereal selltogether!
Association Rules - Strengths & Weaknesses • Strengths • Understandable and easy to use • Useful • Weaknesses • Brute force methods can be expensive (memory and time) • Apriori is O(CD), where C = sum of sizes of candidates (2n possible, n = #items) D = size of database • Association does not necessarily imply correlation • Validation? • Maintenance?
Clustering • Group similar items together • Example: sorting laundry • Similar items may have important attributes / functionality in common • Group customers together with similar interests and spending patterns • Form of unsupervised learning • Cluster objects into classes using rule: • Maximize intraclass similarity, minimize interclass similarity
Clustering Techniques • Partition • Enumerate all partitions • Score by some criteria • K means • Hierarchical • Model based • Hypothesize model for each cluster • Find model that best fits data • AutoClass, Cobweb
Clustering Goal • Suppose you transmit coordinates of points drawn randomly from this dataset • Only allowed 2 bits/point • What encoder/decoder will lose least information?
Idea One • Break into grid • Decode each bit-pair as middle of each grid cell 00 01 10 11
Idea Two • Break into grid • Decode each bit-pair as centroid of all data in the grid cell 00 01 11 10
K Means Clustering • Ask user how many clusters (e.g., k=5)
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center • Each cluster finds new centroid of its points
K Means Clustering • Ask user how many clusters (e.g., k=5) • Randomly guess k cluster center locations • Each data point finds closest center • Each cluster finds new centroid of its points • Repeat until…
K Means Issues • Computationally efficient • Initialization • Termination condition • Distance measure • What should k be?
Hierarchical Clustering • Each point is its own cluster
Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters
Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters • Merge it into a parent cluster
Hierarchical Clustering • Each point is its own cluster • Find most similar pair of clusters • Merge it into a parent cluster • Repeat