1 / 147

Approximate Mining of Consensus Sequential Patterns

Approximate Mining of Consensus Sequential Patterns. Hye-Chung (Monica) Kum University of North Carolina, Chapel Hill Computer Science Department School of Social Work http://www.cs.unc.edu/~kum/approxMAP. Knowledge Discovery & Data mining (KDD).

medea
Download Presentation

Approximate Mining of Consensus Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Mining of Consensus Sequential Patterns Hye-Chung (Monica) Kum University of North Carolina, Chapel Hill Computer Science Department School of Social Work http://www.cs.unc.edu/~kum/approxMAP

  2. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

  3. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

  4. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

  5. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

  6. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

  7. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" • The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

  8. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" • The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner

  9. Knowledge Discovery & Data mining (KDD) • "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" • The goal is to discover and present knowledge in a form, which is easily comprehensible to humans in a timely manner • combining ideas drawn from databases, machine learning, artificial intelligence, knowledge-based systems, information retrieval, statistics, pattern recognition, visualization, and parallel and distributed computing • Fayyad, Piatetsky-Shapiro, Smyth 1996

  10. What is KDD ? • Purpose • Extract useful information • Source • Operational or Administrative Data • Example • VIC card database for buying patterns • monthly welfare service patterns

  11. Example • Analyze buying patterns for sales marketing

  12. Example • VIC card : 4/8 = 50%

  13. Example • VIC card : 5/8=63%

  14. Overview • What is KDD (Knowledge Discovery & Data mining) • Problem : Sequential Pattern Mining • Method : ApproxMAP • Evaluation Method • Results • Case Study • Conclusion

  15. Overview • What is KDD (Knowledge Discovery & Data mining) • Problem : Sequential Pattern Mining • Method : ApproxMAP • Evaluation Method • Results • Case Study • Conclusion

  16. Sequential Pattern Mining

  17. Sequential Pattern Mining • Detecting patterns in sequences of sets

  18. Welfare Program Participation Patterns • What are the common participation patterns ? • What are the variations to them ? • How do different policies affect these patterns?

  19. Thesis Statement • The author of this dissertation asserts that multiple alignment is an effective model to uncover the underlying trend in sequences of sets. • I will show that approxMAP, • is a novel method to apply multiple alignment techniques to sequences of sets, • will effectively extract the underlying trend in the data • by organizing the large database into clusters • as well as give reasonable descriptors (weighted sequences and consensus sequences) for the clusters via multiple alignment • Furthermore, I will show that approxMAP • is robust to its input parameters, • is robust to noise and outliers in the data, • scalable with respect to the size of the database, • and in comparison to the conventional support model, approxMAP can better recover the underlying pattern with little confounding information under most circumstances. • In addition, I will demonstrate the usefulness of approxMAP using real world data.

  20. Thesis Statement • Multiple alignment is an effective model to uncover the underlying trend in sequences of sets. • ApproxMAP is a novel method to apply multiple alignment techniques to sequences of sets. • ApproxMAP can recover the underlying patterns with little confounding information under most circumstances including those in which the conventional methods fail. • I will demonstrate the usefulness of approxMAP using real world data.

  21. Sequential Pattern Mining • Detecting patterns in sequences of sets • Nseq: Total # of sequences in the Database • Lseq: Avg # of itemsets in a sequence • Iseq : Avg # of items in an itemset • Lseq * Iseq : Avg length of a sequence

  22. Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)

  23. Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D)

  24. Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D) • Support (P ): # of super-sequences of P in D

  25. Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D) • Support (P ): # of super-sequences of P in D • Given D, and user threshold, min_sup • find complete set of P s.t. Support(P ) min_sup

  26. Conventional Methods : Support Model • Super-sequence Sub-sequence • (A,B,D)(B)(C,D)(B,C)(A)(B)(C,D) • Support (P ): # of super-sequences of P in D • Given D, and user threshold, min_sup • find complete set of P s.t. Support(P ) min_sup • R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 • Methods • Breadth first – Apriori Principle (GSP) • R. Agrawal and R. Srikant : ICDE 95 & EBDT 96 • Depth first – pattern growth (PrefixSpan) • J. Han and J. Pei : SIGKDD 2000 & ICDE 2001

  27. Example: Support Model • {Dp, Br} {Mk, Dp} {Mk, Dp, Br} : 2/3=67% • 2L - 1= 27-1=128-1=127 subsequences • {Br} {Mk, Dp} {Mk, Dp, Br} • {Dp} {Mk, Dp} {Mk, Dp, Br} • {Dp, Br} {Dp} {Mk, Dp, Br} • {Dp, Br} {Mk} {Mk, Dp, Br} • {Dp, Br} {Mk, Dp} {Dp, Br} • {Dp, Br} {Mk, Dp} {Mk, Br} • {Dp, Br} {Mk, Dp} {Mk, Dp} • {Mk, Dp} {Mk, Dp, Br} • {Dp, Br} {Mk, Dp, Br} • … etc …

  28. Inherent Problems : the model • Support • cannot distinguish between statistically significant patterns and random occurrences • Theoretically • Short random sequences occur often in long sequential data simply by chance • Empirically • # of spurious patterns grows exponential w.r.t. Lseq

  29. Inherent Problems : exact match • A pattern gets support • the pattern is exactly contained in the sequence • Often may not find general long patterns • Example • many customers may share similar buying habits • few of them follow an exactly same pattern

  30. Inherent Problems : Complete set • Mines complete set • Too many trivial patterns • Given long sequences with noise • too expensive and too many patterns • 2L - 1= 210-1=1023 • Finding max / closed sequential patterns • is non-trivial • In noisy environment, still too many max/closed patterns

  31. Possible Models • Support model • Patterns in sets • unordered list • Multiple alignment model • Find common patterns among strings • Simple ordered list of characters

  32. Multiple Alignment • line up the sequences to detect the trend • Find common patterns among strings • DNA / bio sequences

  33. Multiple Alignment • line up the sequences to detect the trend • Find common patterns among strings • DNA / bio sequences

  34. Edit Distance • Pairwise Score(edit distance) : dist(seq1, seq2) • Minimum # of ops required to change seq1 to seq2 • Ops = INDEL(a) and/or REPLACE(a,b) • Recurrence relation

  35. Edit Distance • Pairwise Score(edit distance) : dist(seq1, seq2) • Minimum # of ops required to change seq1 to seq2 • Ops = INDEL(a) and/or REPLACE(a,b) • Recurrence relation • Multiple Alignment Score • ∑PS(seqi, seqj) ( 1 ≤ i ≤ N and 1≤ j ≤ N) • Optimal alignment : minimum score

  36. Consensus Sequence

  37. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence

  38. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence

  39. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence

  40. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence

  41. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • A : 3/3 = 100% • E : 1/3 = 33% • H : 1/3 = 33%

  42. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • Consensus itemset (j) : min_strength=2 • ( ia |  ia(I ()) & strength(ia, j) ≥ min_strength )

  43. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • Consensus itemset (j) : min_strength=2 • ( ia |  ia(I ()) & strength(ia, j) ≥ min_strength ) • Consensus sequence : • concatenation of the consensus itemsets

  44. Consensus Sequence • Weighted Sequence : • compression of aligned sequences into one sequence • strength(i, j) = # of occurrences of item i in position j total # of sequences • Consensus itemset (j) : min_strength=2 • ( ia |  ia(I ()) & strength(ia, j) ≥ min_strength ) • Consensus sequence : • concatenation of the consensus itemsets

  45. Multiple Alignment Sequential Pattern Mining • Given • N sequences of sets, • Op costs (INDEL & REPLACE) for itemsets, and • Strength thresholds for consensus sequences • To (1)partition the N sequences into K sets of sequences such that the sum of the K multiple alignment scores is minimum, and (2) find the multiple alignment for each partition, and (3) find the pattern consensus sequence and the variation consensus sequence for each partition

  46. Overview • What is KDD (Knowledge Discovery & Data mining) • Problem : Sequential Pattern Mining • Method : ApproxMAP • Evaluation Method • Results • Case Study • Conclusion

  47. ApproxMAP (Approximate Multiple Alignment Pattern mining) • Exact solution : Too expensive! • Approximation Method : ApproxMAP • Organize into K partitions • Use clustering • Compress each partition into • weighted sequences • Summarize each partition into • Pattern consensus sequence • Variation consensus sequence

  48. Tasks • Op costs (INDEL & REPLACE) for itemsets • Organize into K partitions • Use clustering • Compress each partition into • weighted sequences • Summarize each partition into • Pattern consensus sequence • Variation consensus sequence

  49. Tasks • Op costs (INDEL & REPLACE) for itemsets • Organize into K partitions • Use clustering • Compress each partition into • weighted sequences • Summarize each partition into • Pattern consensus sequence • Variation consensus sequence

  50. Op costs for itemsets • Normalized set difference • R(X,Y)= (|X-Y|+|Y-X|)/(|X|+|Y|) • 0 ≤ R ≤ 1 , metric • INDEL(X) = R(X,) = 1

More Related