1 / 46

Constraint Mining of Frequent Patterns in Long Sequences

Constraint Mining of Frequent Patterns in Long Sequences. Presented by Yaron Gonen. Outline. Introduction Problems definition and motivation Previous work The CAMLS Algorithm Overview Main contributions Results Future Work. Frequent Item-sets: The Market-Basket Model.

zack
Download Presentation

Constraint Mining of Frequent Patterns in Long Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constraint Mining of Frequent Patterns in Long Sequences Presented by YaronGonen

  2. Outline • Introduction • Problems definition and motivation • Previous work • The CAMLS Algorithm • Overview • Main contributions • Results • Future Work

  3. Frequent Item-sets:The Market-Basket Model • A set of items, e.g., stuff sold in a supermarket • A set of baskets, (later called events or transactions) each of which is a small set of the items, e.g., the things one customer buys on one day.

  4. Support • Support for item-set I = the number of baskets containing all items in I (Usually given as a percentage) • Given a support threshold minSup, sets of items that appear in >minSup baskets are called frequent item-sets • Simplest question: find sets of frequent item-sets

  5. Example • Items: • Minimum Support = 0.6 (2 baskets)

  6. Application (1) • Items: products at a supermarket • Baskets: set of products a customer bought at one time. • Example: many people by beer and diapers together. • Place beer next to diapers to increase both sales • Run a sale on diapers and raise price of beer.

  7. Application (2) (Counter-Intuitive) • Items: species of plants • Baskets: each basket represent an attribute. A basket contains items (plants) that have that attribute • Frequent sets may indicate similarity between plants

  8. Scale of Problem • Costco sells more than 120k different items, and has 57m members (from Wikipedia) • Botany has identified about 350k extant species of plants

  9. The Naïve Algorithm • Generate all possible itemsets. • Check their support. , , , , , , , , … , ,

  10. The Apriori Property • All nonempty subsets of a frequent itemset must also be frequent. X X X X

  11. The Apriori Algorithm Find frequent 1-itemsets Here’s where the apriori property is used. Merge and prune to generate candidate of next size Frequent itemset Has candidates? > min support? Go though whole DB to count support yes no Largest itemset’s length times going over the DB End

  12. Vertical Format • Index on items. • Calculating support is fast 2 1 2 1 2 3 3 3 1 2

  13. Frequent Sequences:Taking it to the Next Level • A large set of sequences. Each of which is a time ordered list of events (baskets), e.g., all the stuff a single customer buys over time 2 weeks 5 days

  14. Support • Subsequence: a sequence, that all its events are subsets of another sequence, in the same order (but not necessarily consecutive) • Support for subsequence s = the number of sequences containing s (Usually given as a percentage) • Given a supportthresholdminSup, subsequences that appear in >minSup sequences are called frequent subsequences • Simplest question: find all frequent subsequence

  15. Notations • Items are letters: a,b,… • Events are parenthesized: (ab), (bdf),… • Except for events with single items • Sequences are surrounded by <…> • Every sequence has an identifier sid

  16. Example minSup= 0.5

  17. Motivation • Customer shopping patterns • Stock market fluctuation • Weblog click stream analysis • Symptoms of a diseases • DNA sequence analysis • Weather forecast • Machine anti aging • Many more…

  18. Much Harder than Frequent Item-sets! 2m*npossible candidates! Where m is the number of items, and n in the number of transactions in the longest sequence

  19. The Apriori Property • If a sequence is not frequent, then any sequence that contains it cannot be frequent

  20. Constraints • Problems: • Too many frequent sequences • most frequent sequences are not useful • Solution remove them • Constraints are a way to define usefulness • The trick do so while mining

  21. Previous Work • GSP (Srikant and Agrawal, 1996) • Generation-and-test Apriori Based approach • SPADE (Zaki, 2001) • Generation-and-test Apriori Based approach • Uses equivalence-class for memory optimization • Uses a vertical-format db • PrefixSpan (Pei, 2004) • No candidate generation • Using a db-projection method

  22. Why a New Algorithm? • Huge set of candidate-sequences/projected db generated • Multiple Scans of database needed • Inefficient for mining long sequential patterns • No exploits of domain-specific properties • Weak constraints support

  23. The CAMLS Algorithm • Constraint-based Apriori algorithm for Mining Long Sequences • Designed especially for efficient mining of long sequences • Outperforms SPADE and PrefixSpan on both synthetic and real data

  24. The CAMLS Algorithm Makes a logical distinction between two types of constraints: • Intra-Event: not time related (i.e. mutually exclusive items) • Inter-Event: addresses the temporal aspect of the data (i.e. values that can or cannot appear one after the other)

  25. Event-wise Constraints • Event must/must not contain a specific item • Two items cannot occur on the same time • max_event_length: An event cannot contain more than a fixed number of items

  26. Sequence–wise Constraints • max_sequence_length: a sequence cannot contain more than a fixed number of events • max_gap: long time between events dismisses the pattern

  27. CAMLS Overview Input Event-wise Sequence-wise Output Constraints (minSup, maxGap, …) Frequent events + occurrence index

  28. What Do We Get? • The best of both worlds: • Much less candidates are being generated. • Support check is fast. • Worst case: works like SPADE. • Tradeoff: Uses a bit more memory (for storing the frequent item-sets).

  29. Event-wise Phase • Input: sequence database and constraints • Output: frequent events + occurrence index • Use Apriori or FP-Growth to find frequent itemsets (both with minor modifications)

  30. Example soon! Event-wise • L1 = all frequent items • fork=2;Lk-1≠Φ;k++do • generateCandidates(Lk-1) • Lk= pruneCandidates() • L = LLk • end for If two frequent (k-1) event have the same prefix merge them and form a new candidate Prune, calculate support count and create occurrence index

  31. Occurrence Index • A compact representation of all occurrences of a sequence • Structure: list of sids, each associated with a list of eids sequence eid1 eid2 eid3 Example on next slide! sid1 sid2 eid4 eid5 sid3 eid6 eid7 eid8 eid9

  32. minSup=2 Event-wise Example(Using Apriori) candidates: (ab),(ac),(ad),(bc),… All frequent items: a:3, b:2, c:3, d:3 Support count: (ac):2, (ad):2, (bd):2, (cd):2 candidates: (abc), (abd),(acd),… Support count: (acd):2 0 1 3 11 No more candidates!

  33. Sequence-wise Phase • Input: frequent events + occurrence index, constraints • Output: all frequent sequences • Similar to GSP’s and SPADE’s candidate generation phase – except using the frequent itemsets as seeds

  34. Sequence-wise • L1 = all frequent 1-sequences • fork=2;Lk-1≠Φ;k++do • generateCandidates(Lk-1) • Lk= pruneAndSupCalc() • L = L Lk • end for Elaboration on next two slide

  35. Sequence-wise Candidate Generation • If two frequent k-sequences s’ and s’’ share a common k-1 prefix and s1 is a generator, we form a new candidate s‘ = <s’1s’2…s’k> <s’1s’2…s’k s’’ = <s’’1s’’2…s’’k> s’’k> <s’1s’2…s’k-1> = <s’’1s’’2…s’’k-1>

  36. Sequence-wise Pruning • Keep a radix-ordered list of pruned sequences in current iteration • In the same iteration its possible that a k-sequence will contain another k-sequence in the same iteration. • With a new candidate: • Check subsequence in pruned list: Very Fast! • Test for frequency • Add to pruned list if needed

  37. Support Calculation • A simple intersection operation between the occurrence index of the forming sequences • When a new occurrence index is formed, calculation is trivial

  38. The maxGap Constraint • maxGap is a special kind of constraint: • Data dependant • Apriori property not applicable • The occurrence index enables fast maxGap check • A frequent sequence that does not satisfy maxGap is flagged as non-generator. Example: • Assume <ab> is frequent but gap between a and b > maxgap • But frequent sequences <ac> and <ab> and in <acb> all maxgap constraints are ok! • So <ab> is a non-Generator but kept in order not to prune <acb>

  39. Original DB Sequence-Wise Example Event-wise Candidate generation minSup=2 maxGap=5 <aa> is added to pruned list. <a(ac)> is a super-sequence of <aa>, therefore it is pruned. <ab> does not pass maxGap, therefore it is not a generator. No more candidates!

  40. Evaluation (1):Machine Anti Aging How can Sequence Mining Help? • Data collected from machine is a sequence • Discover typical behavior leading to failure • Monitor machine and alert before failure • Domain: • Light intensity for wavelengths (continuous) • Pre-process • Discretization • Meta features (maxDisc, maxWL, isBurned) • Synm stands for a synthetic database simulating the machine behavior with m meta-features

  41. Evaluation (2) • Real Stocks data values • Rn stands for stock data (10 different stocks) for n days

  42. CAMLS Compared with PrefixSpan

  43. CAMLS Compared with Spade and PrefixSpan

  44. So, What’s CAMLS Contribution? • Constraints distinction: easy implementation • Two phases • Handling on the MaxGap constraint • Occurrence index data structure • Fast new pruning method

  45. Future Research • Main issue: closed sequences • More constraints (aspiring regexp)

  46. Thank You!

More Related