1 / 24

Market Basket Analysis and Advanced Data Mining

issac
Download Presentation

Market Basket Analysis and Advanced Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Market Basket Analysis and Advanced Data Mining Professor Amit Basu abasu@smu.edu

    3. Examples Rule form: LHS ® RHS IF a customer buys diapers, THEN they also buy beer diapers ® beer “Transactions that purchase bread and butter also purchase milk” bread ? butter ? milk Customers who purchase maintenance agreements are very likely to purchase large appliances When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners

    4. Representations What’s the difference between these patterns? (a) Risk = 0.3 * sin(numcards * dem10.25) + 0.83 * (pastdef - dem2) * cos(employed+dem1)2 (b) Risk = 0.93 * priordefault + 0.23 * num_cards – 1.3 * employed – 0.734 (c) IF person has a good credit rating THEN they have fewer accidents

    5. Evaluation Support : measure of how often the collection of items in an association occur together as a percentage of all the transactions In 2% of the purchases at hardware store, both pick and shovel were bought support = #tuples(LHS, RHS)/N Confidence : confidence of rule “B given A” is a measure of how much more likely it is that B occurs when A has occurred 100% meaning that B always occurs if A has occurred confidence = #tuples(LHS, RHS) / #tuples(LHS) Example: bread and butter ? milk [90%, 1%] Rules originating from the same itemset have identical support but can have different confidence

    6. The association rules mining problem Generate all association rules from the given dataset that have support greater than a specified minimum and confidence greater than a specified minimum

    7. Examples Rule form: LHS ® RHS [confidence, support] diapers ® beer [60%, 0.5%] “90% of transactions that purchase bread and butter also purchase milk” bread and butter ? milk [90%, 1%]

    8. Example

    9. How Good is an Association Rule? Is support and confidence enough? Lift (improvement) tells us how much better a rule is at predicting the result than just assuming the result in the first place Lift = P(LHS^RHS) / (P(LHS).P(RHS) When lift > 1 then the rule is better at predicting the result than guessing When lift < 1, the rule is doing worse than informed guessing and using the Negative Rule produces a better rule than guessing

    10. Computational Complexity Given d unique items: Total number of itemsets = 2d Total number of possible association rules:

    11. The Problem of Lots of Data Fast Food Restaurant…could have 100 items on its menu How many combinations are there with 3 different menu items? 161,700 ! Supermarket…10,000 or more unique items 50 million 2-item combinations 100 billion 3-item combinations Use of product hierarchies (groupings) helps address this common issue Also, the number of transactions in a given time-period could also be huge (hence expensive to analyze)

    12. Preparing Data for MBA Determining scope of dataset (one or many stores, what period, etc) Converting transaction data to itemsets Generalizing items to appropriate level Depends on objective of model Rolling up rare items to get adequate support

    13. Search Approach Two sub-problems in discovering all association rules: Find all sets of items (itemsets) that have transaction support above minimum support Itemsets that qualify are called large itemsets, and all others small itemsets. Generate from each large itemset, rules that use items from the large itemset. Given a large itemset Y, and X is a subset of Y Take the support of Y and divide it by the support of X If the ratio c is at least minconf, then X ? (Y - X) is satisfied with confidence factor c

    14. Reducing Number of Candidates Apriori principle: If an itemset is large, then all of its subsets must also be large Support of an itemset never exceeds the support of its subsets

    15. The Apriori Algorithm Progressively identifies large itemsets of different sizes Exploits the property that any subset of a large itemset is also a large itemset Also, any superset of a small itemset is also small

    16. Extending MBA Dissociation rules Combining transaction data with complementary data Shopper characteristics Store characteristics Seasonal factors Analyzing patterns over time Patterns that span multiple occasions Need to “sessionize” data Need to recognize shoppers across sessions

    17. Usability of Association Rules

    18. Advanced Data Mining Text Mining Mining non-textual data Image and video data (Multimedia) Spatial data GIS Temporal data Time series Behavioral patterns Web Mining Web usage Web content

    19. Mining Image Data Traditional pattern recognition Neural networks Supervised learning Discovering patterns Unsupervised learning Clustering

    20. Mining Spatial Data Spatial databases typically use special data structures Extensions of tree-structured indexes Quad trees, R-trees, k-D trees, etc. Relationships based on spatial descriptors Overlapping, disjoint, contains, etc. Distance-based clustering Feature extraction Association rules If location is near lake, pollution is low

    21. Web Mining Mining data that is obtained from the Web Web Content mining Web Usage mining

    22. Web Content Mining Search engines Spiders and Crawlers Metacrawlers A major challenge is the unstructured form of the data Lack of high-level standards Abuse of descriptors (meta-information)

    23. Web Usage Mining Mining Web logs Data is relatively structured Data is highly dynamic Problems with identification and location The inherently non-linear aspects of Web usage behavior Tracking both forward and backward links Dynamic personalization

    24. Issues and Trends Mining across multiple data sources and sets Online mining – what are the patterns right now? Concerns about privacy and other ethical questions Property Accuracy

More Related