1 / 17

Assessing the Quantitative Significance of Sequential Patterns

Assessing the Quantitative Significance of Sequential Patterns. Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11. Project Statement. We seek to find a method to quantitatively describe the significance of general sequential patterns. What is “significant” or “interesting?”.

dong
Download Presentation

Assessing the Quantitative Significance of Sequential Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

  2. Project Statement • We seek to find a method to quantitatively describe the significance of general sequential patterns

  3. What is “significant” or “interesting?” What makes a pattern an interesting one? Naive answer: depth and length

  4. P-Values • P-value(Pattern p) = Probability(p occurs naturally at least as often as it does in our data) • Smaller p-values mean more significant

  5. Why would we care? • Almost all significance measures deal with non-sequential data • Those dealing with sequential data are incredibly data-specific • Identifies patterns that matter from products of the data set’s structure

  6. Sequential vs. Non-sequential Data • Examples of Non-sequential Data: • Groceries purchased • Facebook friends • Top 5 favorite exotic fruits and vegetables • Examples of Sequential Data: • Words • DNA Sequences • Number of hours you sleep per night • Unclear/Could be both • Products purchased on Amazon (student prime!) • Books read

  7. Structural Differences • Non-sequential Data- • Easily expressed as a matrix of supports • No problems with subsets having different sizes • Easy to construct similar data sets thru randomization • Sequential Data- • Cannot be expressed as a 2-D matrix of supports • Subsets of different lengths are problematic for matrix • Cannot carry out randomization on a matrix of items

  8. Solution? Think Simpler! • We’re looking for a method for general sequential patterns • Proposal- • Randomize the ordering of items in each sequence • Obtain a probability of a pattern occurring for each sequence • Use such probabilities to generate a distribution for total number of pattern occurrences

  9. Computing p-values • For each sequence in the data set, find the probability that if its ordering is randomized, the pattern will occur • With each sequence having a probability of containing a given pattern, construct the overall distribution of times said pattern occurs in the data set

  10. Use combinatorics to analyze and compute the probability that a random ordering of a given sequence will contain pattern P • N = # of unique orderings = ( ) • For ABCDE: ( ) For ABCBA: ( ) • M = (sequence length – pattern length +1)( ) • For P=ABC and sequence ABCBA: M=(3)( ) • So the probability of ABCBA containing pattern P=ABC is M/N = 1/5 Sequence length Dictionary Values 5 5 2,2,1 1,1,1,1,1 Surplus length Surplus Dictionary Values 2 1, 1

  11. Advantages • All work is probabilistic, finding p-values is very fast operation • Longer patterns’ significance can be built off of shorter patterns’ significance • Allows large, comprehensive sets of patterns to be judged in significance • Could lead to significance-based closed-frequent patter finding algorithm

  12. Related Works • Randomization of real-valued matrices for assessing the significance of data mining results by Markus Ojala • Ranking Sequential Patterns with Respect to Significance by Robert Gwadera • Frequent Pattern Mining with Uncertain Data by CharuAggarwal

  13. Further Study • Dealing with patterns occurring multiple times within one sequence • Modifying significance calculation to allow for more flexibility while maintaining overall structure of data • Algorithmic applications, especially in closed-frequent types of pattern finding algorithms

  14. In Conclusion • Our method provides great accessibility to the field of sequential patterns • Combinatoric approach means it runs very fast • Significance calculation approach is highly scalable for huge sets of patterns

  15. Thank you for listening!

More Related