EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Administrative • Paper presentation schedule: • Han, Bin, Kernel method in Analyzing Biological Data, Nov 6th • Barker, Brett, Data Mining in Systems Biology, Nov 8th • Leung, Daniel, High performance in Data Mining, Nov 13th • Ku, Matthew, Data Mining in Proteomics, Nov 15th • Lin, Cindy, Integrating Biological Data, Nov 20th • Jia, Yi, Analyzing Bionetworks, Nov 22th

Sequential Pattern Mining • Why sequential pattern mining? • GSP algorithm • FreeSpan and PrefixSpan • Boarder Collapsing • Constraints and extensions

Sequence Databases and Sequential Pattern Analysis • (Temporal) order is important in many situations • Time-series databases and sequence databases • Frequent patterns  (frequent) sequential patterns • Applications of sequential pattern mining • Customer shopping sequences: • First buy computer, then CD-ROM, and then digital camera, within 3 months. • Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, telephone calling patterns, Weblog click streams, DNA sequences and gene structures

What Is Sequential Pattern Mining? • Given a set of sequences, find the complete set of frequent subsequences A sequence: < (ef) (ab) (df) c b > A sequence database An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should • Find the complete set of patterns satisfying the minimum support (frequency) threshold • Be highly efficient, scalable, involving only a small number of database scans • Be able to incorporate various kinds of user-specific constraints

Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> A Basic Property of Sequential Patterns: Apriori • A basic property: Apriori (Agrawal & Sirkant’94) • If a sequence S is not frequent • Then none of the super-sequences of S is frequent • E.g, <hb> is infrequent  so do <hab> and <(ah)b> Given support thresholdmin_sup =2

Basic Algorithm : Breadth First Search (GSP) • L=1 • While (ResultL != NULL) • Candidate Generate • Prune • Test • L=L+1

Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Finding Length-1 Sequential Patterns • Initial candidates: all singleton sequences • <a>, , <c>, <d>, <e>, <f>, <g>, <h> • Scan database once, count support for candidates min_sup =2

Seq. ID Sequence Cand. cannot pass sup. threshold 5th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> Cand. not in DB at all <abba> <(bd)bc> … 4th scan: 8 cand. 6 length-4 seq. pat. 30 <(ah)(bf)abf> 3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all <abb> <aab> <aba> <baa> <bab> … 40 <(be)(ce)d> 2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 50 <a(bd)bcb(ade)> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> 1st scan: 8 cand. 6 length-1 seq. pat. <a> <c> <d> <e> <f> <g> <h> The Mining Process min_sup =2

Generating Length-2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

The SPADE Algorithm • SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 • A vertical format sequential pattern mining method • A sequence database is mapped to a large set of • Item: <SID, EID> • Sequential pattern mining is performed by • growing the subsequences (patterns) one item at a time by Apriori candidate generation

The SPADE Algorithm

Bottlenecks of GSP and SPADE • A large set of candidates could be generated • 1,000 frequent length-1 sequences generate s huge number of length-2 candidates! • Multiple scans of database in mining • Breadth-first search • Mining long sequential patterns

Pattern Growth (prefixSpan) • Prefix and Suffix (Projection) • <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> • Given sequence <a(abc)(ac)d(cf)>

Example An Example ( min_sup=2):

PrefixSpan (the example to be continued) Step1: Find length-1 sequential patterns; <a>:4, :4, <c>:4, <d>:3, <e>:3, <f>:3 support pattern Step2: Divide search space; six subsets according to the six prefixes; Step3: Find subsets of sequential patterns; By constructing corresponding projected databases and mine each recursively.

Example • Find sequential patterns having prefix <a>: • Scan sequence database S once. Sequences in S containing <a> are projected w.r.t <a> to form the <a>-projected database. • Scan <a>-projected database once, get six length-2 sequential patterns having prefix <a> : • <a>:2 , :4, <(_b)>:2, <c>:4, <d>:2, <f>:2 • <aa>:2 , <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2 • Recursively, all sequential patterns having prefix <a> can be further partitioned into 6 subsets. Construct respective projected databases and mine each. • e.g. <aa>-projected database has two sequences : • <(_bc)(ac)d(cf)> and <(_e)>.

Example to be continued

PrefixSpan Algorithm Main Idea: Use frequent prefixes to divide the search space and to project sequence databases. only search the relevant sequences. PrefixSpan(, i, S|) • Initially  is a single frequent element in S • Scan S| once, find the set of frequent items b such that • b can be assembled to the last element of  to form a sequential pattern; or • can be appended to  to form a sequential pattern. • For each frequent item b, appended it to  to form a sequential pattern ’, and output ’; • For each ’, construct ’-projected database S|’, and call PrefixSpan(’, i+1,S|’).

CloSpan: Mining Closed Sequential Patterns • A closed sequential patterns: there exists no superpattern s’ such that s’ כ s, and s’ and s have the same support • Motivation: reduces the number of (redundant) patterns but attains the same expressive power • Using Backward Subpattern and Backward Superpattern pruning to prune redundant search space • CloSpan: Mining closed sequential pattern in large datasets, Yan et al, SDM’03 Backward subpattern Backward superpattern

CloSpan: Performance Comparison with PrefixSpan

Noise-tolerant Sequence Patterns • There are noises in real-world sequences data • Biological sequences • Gene expression profiles • Web-log collection • Compatibility matrix is introduced to tolerate certain level of noise • Yang et al. Mining Long Sequential Patterns in a Noisy Environment, SIGMOD’01

Approximate Match • When you observe d1 • Spread count as • d1: 90%, d2: 5%, d3: 5% Compatibility Matrix

Match • The degree to which pattern P is retained/reflected in sequence S • M(P,S) = P(P|S) • M(P, S) = C(p,s) when when lS=lP • M(P,S) = max over all possible when lS>lP • Example

Calculate Max over all • Dynamic Programming • M(p1p2..pi, s1s2…sj)= Max of • M(p1p2..pi-1, s1s2…sj-1) * C(pi,sj) • M(p1p2..pi, s1s2…sj-1) • O(lP*lS) • When compatibility Matrix is sparse O(lS)

Match in a Sequence

Match in D • Average over all sequences in D

Anti-Monotone • If compatibility matrix is identity matrix, match = support • Theorem: the match of a pattern P in a symbol sequence S is less than or equal to the match of any subpattern of P in S • Corollary: the match of a pattern P in a sequence database D is less than or equal to the match of any subpattern of P in D • Can use any support based algorithm • More patterns match so require efficient solution • Sample based algorithms • Border collapsing of ambiguous patterns

Chernoff Bound • Given sample size=n, sample mean = μ, and we know that the range of the data is R, then we have: population mean is μ  •  = sqrt([R2ln(1/)]/2n) • with probability 1- (almost certain) • Can the estimation be replaced by normal due to the law of large number? • Distribution free • More conservative • Sample size: fit in memory • Restricted spread : • For pattern P= p1p2..pL • R=min (match[pi]) for all 1  i L

Algorithm • Scan DB: O(N*Ls*m) • Find the match of each individual symbol • Take a random sample of sequences • N, # of sequence, Ls, average sequence length, m: # of symbols • Identify borders that embrace the set of ambiguous patterns O(mLp * |S| * Lp * n) • Min_match   • existing methods for association rule mining • Lp is the length of the largest patter, S, average length in sample sequence, n # of samples • Locate the border of frequent patterns • via border collapsing

Border Collapsing • If memory can not hold the counters for all ambiguous counters • Probe-and-collapse : binary search • Probe patterns with highest collapsing power until memory is filled • If memory can hold all patterns up to the 1/x layer • the space of of ambiguous patterns can be narrowed to at least 1/x of the original one • where x is a power of 2 • If it takes a level-wise search y scans of the DB, only O(logxy) scans are necessary when the border collapsing technique is employed

Border Collapsing

Episodes and Episode Pattern Mining • Other methods for specifying the kinds of patterns • Serial episodes: A  B • Parallel episodes: A & B • Regular expressions: (A | B)C • Methods for episode pattern mining • First find all frequent serial and parallel episode • Combine frequent serial and parallel episode to derive general episode or regular expressions • Discovery of Frequent Episodes in Event Sequences, Mannila, et al., Data Mining and Knowledge Discovery, 1, pp. 259-89, 97

Periodicity Analysis • Periodicity is everywhere: tides, seasons, daily power consumption, etc. • Full periodicity • Every point in time contributes (precisely or approximately) to the periodicity • Partial periodicit: A more general notion • Only some segments contribute to the periodicity • Jim reads NY Times 7:00-7:30 am every week day • Cyclic association rules • Associations which form cycles • Methods • Full periodicity: FFT, other statistical analysis methods • Partial and cyclic periodicity: Variations of Apriori-like mining methods

Periodic Pattern • Full periodic pattern • ABCABCABC • Partial periodic pattern • ABC ADC ACC ABC • Pattern hierarchy • ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE

Periodic Pattern • Recent Achievements • Partial Periodic Pattern • Asynchronous Periodic Pattern • Meta Pattern • InfoMiner/InfoMiner+/STAMP

Constraint-Based Seq. Pattern Mining • Constraint-based sequential pattern mining • Constraints: User-specified, for focused mining of desired patterns • How to explore efficient mining with constraints? — Optimization • Classification of constraints • Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10 • Monotone: E.g., count (S) > 5, S  {PC, digital_camera} • Succinct: E.g., length(S)  10, S  {Pentium, MS/Office, MS/Money} • Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160, max(S)/avg(S) < 2, median(S) – min(S) > 5 • Inconvertible: E.g., avg(S) – median(S) = 0

From Sequential Patterns to Structured Patterns • Sets, sequences, trees, graphs, and other structures • Transaction DB: Sets of items • {{i1, i2, …, im}, …} • Sets of Sequences: • {{<i1, i2>, …, <im,in, ik>}, …} • Sets of trees: {t1, t2, …, tn} • Sets of graphs (mining for frequent subgraphs): • {g1, g2, …, gn} • Mining structured patterns in XML documents, bio-molecule structures, etc.

References: Sequential Pattern Mining Methods • R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan. • R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’96. • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000. • H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997. • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.

References: Sequential Pattern Mining Methods • B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL. • S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY. • M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98. Novermber 1998. • M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999: 223-234, Edinburgh, Scotland. • Wei Wang, Jiong Yang, Philip S. Yu: Mining Patterns in Long Sequential Data with Noise. SIGKDD Explorations 2(2): 28-33 (2000) • Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han: Mining Long Sequential Patterns in a Noisy Environment. SIGMOD Conference 2002

References: Periodic Pattern Mining Methods • Jiawei Han, Wan Gong, Yiwen Yin: Mining Segment-Wise Periodic Patterns in Time-Related Databases. KDD 1998: 214-218 • Jiawei Han, Guozhu Dong, Yiwen Yin: Efficient Mining of Partial Periodic Patterns in Time Series Database. ICDE 1999: 106-115 • Jiong Yang, Wei Wang, Philip S. Yu: Mining asynchronous periodic patterns in time series data. KDD 2000: 275-279 • Wei Wang, Jiong Yang, Philip S. Yu: Meta-patterns: Revealing Hidden Periodic Patterns. ICDM 2001: 550-557 • Jiong Yang, Wei Wang, Philip S. Yu: Infominer: mining surprising periodic patterns. KDD 2001: 395-400

EECS 800 Research Seminar Mining Biological Data