1 / 108

Advanced Topics in Data Mining

Advanced Topics in Data Mining. Association Rules Sequential Patterns Web Mining. Where to Find References?. Proceedings Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Proceedings of IEEE International Conference on Data Mining (ICDM)

ovidio
Download Presentation

Advanced Topics in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Topics in Data Mining • Association Rules • Sequential Patterns • Web Mining

  2. Where to Find References? • Proceedings • Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining • Proceedings of IEEE International Conference on Data Mining (ICDM) • Proceedings of IEEE International Conference on Data Engineering (ICDE) • Proceedings of the International Conference on Very Large Data Bases (VLDB) • ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery • Proceedings of the International Conference on Data Warehousing and Knowledge Discovery

  3. Where to Find References? • Proceedings • Proceedings of ACM SIGMOD International Conference on Management of Data • Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD) • European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) • Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA) • Proceedings of the International Conference on Database and Expert Systems Applications (DEXA)

  4. Where to Find References? • Journal • IEEE Transactions on Knowledge and Data Engineering (TKDE) • Data Mining and Knowledge Discovery • Journal of Intelligent Information Systems • ACM SIGMOD Record • The International Journal on Very Large Database • Knowledge and Information Systems • Data & Knowledge Engineering • International Journal of Cooperative Information Systems

  5. Advanced Topics in Data Mining:Association Rules

  6. Association Analysis

  7. What Is Association Mining? • Association Rule Mining • Finding frequent patterns, associations, correlations, or causal structures among item sets in transaction databases, relational databases, and other information repositories • Applications • Market basket analysis (marketing strategy: items to put on sale at reduced prices), cross-marketing, catalog design, shelf space layout design, etc • Examples • Rule form: Body ® Head [Support, Confidence]. • buys(x, “Computer”) ® buys(x, “Software”) [2%, 60%] • major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

  8. Market Basket Analysis Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.

  9. Rule Measures: Support and Confidence • Let minimum support 50%, and minimum confidence 50%, we have • A  C [50%, 66.6%] • C  A [50%, 100%]

  10. Support & Confidence

  11. Association Rule: Basic Concepts • Given • (1) database of transactions, • (2) each transaction is a list of items (purchased by a customer in a visit) • Find all rules that correlate the presence of one set of items with that of another set of items • Find all the rules A  B with minimum confidence and support • support, s, P(A  B) • confidence, c, P(B|A)

  12. Association Rule Mining:A Road Map • Boolean vs. quantitative associations (Based on the types of values handled in the rule set) • buys(x, “SQLServer”) ^ buys(x, “DM Book”)  buys(x, “DBMiner”) [0.2%, 60%] • age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] • Single dimension vs. multiple dimensional associations • Single level vs. multiple-level analysis (Based on the levels of abstractions involved in the rule set)

  13. Terminologies • Item • I1, I2, I3, … • A, B, C, … • Itemset • {I1}, {I1, I7}, {I2, I3, I5}, … • {A}, {A, G}, {B, C, E}, … • 1-Itemset • {I1}, {I2}, {A}, … • 2-Itemset • {I1, I7}, {I3, I5}, {A, G}, …

  14. Terminologies • K-Itemset • If the length of the itemset is K • Frequent K-Itemset • If the length of the itemset is K and the itemset satisfies a minimum support threshold. • Association Rule • If a rule satisfies both a minimum support threshold and a minimum confidence threshold

  15. Analysis • The number of itemsets of a given cardinality tends to grow exponentially

  16. Mining Association Rules: Apriori Principle Min. support 50% Min. confidence 50% • For rule A  C: • support = support({A C}) = 50% • confidence = support({A C})/support({A}) = 66.6% • The Apriori principle: • Any subset of a frequent itemset must be frequent

  17. Mining Frequent Itemsets: the Key Step • Find the frequent itemsets: the sets of items that have minimum support • A subset of a frequent itemset must also be a frequent itemset • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset • Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) • Use the frequent itemsets to generate association rules

  18. Example

  19. Apriori Algorithm

  20. Apriori Algorithm

  21. Apriori Algorithm

  22. Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd}

  23. C1 count 1 2 2 3 3 3 4 1 5 3 L1 1 2 3 5 Database D 1 3 4 2 3 5 1 2 3 5 2 5 generate L1 scan D count C1 C2 12 13 15 23 25 35 C2 count 12 1 13 2 15 1 23 2 25 3 35 2 L2 13 23 25 35 generate C2 generate L2 scan D count C2 C3 235 C3 count 235 2 L3 235 generate C3 generate L3 scan D count C3 Another Example 1

  24. Another Example 2

  25. Is Apriori Fast Enough? — Performance Bottlenecks • The core of the Apriori algorithm: • Use frequent (k–1)-itemsets to generate candidate frequent k-itemsets • Use database scan to collect counts for the candidate itemsets • The bottleneck of Apriori: • Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. • Multiple scans of database: • Needs (n +1) scans, n is the length of the longest pattern

  26. Demo-IBM Intelligent Minner

  27. Demo Database

  28. Methods to Improve Apriori’s Efficiency • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness

  29. Partitioning

  30. Hash-Based Itemset Counting = * +

  31. Compare Apriori & DHP (Direct Hash & Pruning) Apriori

  32. Compare Apriori & DHP DHP

  33. DHP: Database Trimming

  34. DHP (Direct Hash & Pruning) • A database has four transactions • Let min_sup = 50%

  35. Example: Apriori

  36. Example: DHP

  37. Example: DHP

  38. Example: DHP

  39. Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation & sub-database test only!

  40. Construct FP-tree from a Transaction DB

  41. Construction Steps • Scan DB once, find frequent 1-itemset (single item pattern) • Order frequent items in frequency descending order • Sorting DB according to the frequency descending order • Scan DB again, construct FP-tree

  42. Benefits of the FP-Tree Structure • Completeness • never breaks a long pattern of any transaction • preserves complete information for frequent pattern mining • Compactness • reduce irrelevant information—infrequent items are gone • frequency descending ordering: more frequent items are more likely to be shared • never be larger than the original database (if not count node-links and counts) • Compression ratio could be over 100

  43. For I5 {I1, I5}, {I2, I5}, {I2, I1, I5} For I4 {I2, I4} For I3 {I1, I3}, {I2, I3}, {I2, I1, I3} For I1 {I2, I1} Frequent Pattern Growth Order frequent items in frequency descending order

  44. For I5 {I1, I5}, {I2, I5}, {I2, I1, I5} For I4 {I2, I4} For I3 {I1, I3}, {I2, I3}, {I2, I1, I3} For I1 {I2, I1} Frequent Pattern Growth Sub DB Trimming Databases Sub DB Sub DB FP-tree Sub DB Conditional FP-tree from Conditional Pattern-Base

  45. Conditional FP-tree Conditional FP-tree from Conditional Pattern-Base for I3

  46. For I5 (不產生NULL) Conditional Pattern Base {(I2I1:1), (I2I1I3:1)} Conditional FP-tree Generate Frequent Itemsets I2:2 Rule: I2I5:2 I1:2 Rule: I1I5:2 I2I1:2 Rule: I2I1I5:2 Mining Results Using FP-tree ◎ NULL: 2 ◎ “I2”: 2 ◎ “I1”: 2

  47. For I4 Conditional Pattern Base {(I2I1:1), (I2:1)} Conditional FP-tree Generate Frequent Itemsets I2:2 Rule: I2I4:2 Mining Results Using FP-tree ◎ NULL: 2 ◎ “I2”: 2

More Related