7 sequence mining
Download
1 / 23

7. Sequence Mining - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

7. Sequence Mining. Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules. Sequences and Strings. A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence Sequences and strings are often used as synonyms

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 7. Sequence Mining' - adanna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
7 sequence mining

7. Sequence Mining

Sequences and Strings

Recognition with Strings

MM & HMM

Sequence Association Rules

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Sequences and strings
Sequences and Strings

  • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence

    • Sequences and strings are often used as synonyms

    • String elements (characters, letters, or symbols) are nominal

    • A type of particularly long string text

  • |x| denotes the length of sequence x

    • |AGCTTC| is 6

  • Any contiguous string that is part of x is called a substring, segment, or factor of x

    • GCT is a factor of AGCTTC

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Recognition with strings
Recognition with Strings

  • String matching

    • Given x and text, determine whether x is a factor of text

  • Edit distance (for inexact string matching)

    • Given two strings x and y, compute the minimum number of basic operations (character insertions, deletions and exchanges) needed to transform x into y

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


String matching
String Matching

  • Given |text| >> |x|, with characters taken from an alphabet A

    • A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}

  • A shift s is an offset needed to align the first character of x with character number s+1 in text

  • Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Na ve brute force string matching
Naïve (Brute-Force) String Matching

  • Given A, x, text, n = |text|, m = |x|

    s = 0

    whiles ≤ n-m

    ifx[1 …m] = text [s+1 … s+m]

    then print “pattern occurs at shift” s

    s = s + 1

  • Time complexity (worst case): O((n-m+1)m)

  • One character shift at a time is not necessary

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Boyer moore and kmp
Boyer-Moore and KMP

  • See StringMatching.ppt and do not use the following alg

  • Given A, x, text, n = |text|, m = |x|

    F(x) = last-occurrence function

    G(x) = good-suffix function; s = 0

    whiles ≤ n-m

    j = m

    while j>0 andx[j] = text [s+j]

    j = j-1

    if j = 0

    then print “pattern occurs at shift” s

    s = s + G(0)

    else s = s + max[G(j), j-F(text[s+j0])]

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Edit distance
Edit Distance

  • ED between x and y describes how many fundamental operations are required to transform x to y.

  • Fundamental operations (x=‘excused’, y=‘exhausted’)

    • Substitutions e.g. ‘c’ is replaced by ‘h’

    • Insertions e.g. ‘a’ is inserted into x after ‘h’

    • Deletions e.g. a character in x is deleted

  • ED is one way of measuring similarity between two strings

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Classification using ed
Classification using ED

  • Nearest-neighbor algorithm can be applied for pattern recognition.

    • Training: data of strings with their class labels stored

    • Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string.

  • The key is how to calculate ED.

  • An example of calculating ED

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Hidden markov model
Hidden Markov Model

  • Markov Model: transitional states

  • Hidden Markov Model: additional visible states

  • Evaluation

  • Decoding

  • Learning

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Markov model
Markov Model

  • The Markov property:

    • given the current state, the transition probability is independent of any previous states.

  • A simple Markov Model

    • State ω(t) at time t

    • Sequence of length T:

      • ωT = {ω(1), ω(2), …, ω(T)}

    • Transition probability

      • P(ωj(t+1)| ωi(t)) = aij

    • It’s not required that aij =aji

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Hidden markov model1
Hidden Markov Model

  • Visible states

    • VT = {v(1), v(2), …, v(T)}

  • Emitting a visible state vk(t)

    • P(v k(t)| ωj(t)) = bjk

  • Only visible states vk (t) are accessibleand states ωi (t) are unobservable.

  • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state.

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Three key issues with hmm
Three Key Issues with HMM

  • Evaluation

    • Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model

  • Decoding

    • Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT.

  • Learning

    • Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk.

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Other sequential patterns mining problems
Other Sequential Patterns Mining Problems

  • Sequence alignment (homology) and sequence assembly (genome sequencing)

  • Trend analysis

    • Trend movement vs. cyclic variations, seasonal variations and random fluctuations

  • Sequential pattern mining

    • Various kinds of sequences (weblogs)

    • Various methods: From GSP to PrefixSpan

  • Periodicity analysis

    • Full periodicity, partial periodicity, cyclic association rules

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Periodic pattern
Periodic Pattern

  • Full periodic pattern

    • ABCABCABC

  • Partial periodic pattern

    • ABC ADC ACC ABC

  • Pattern hierarchy

    • ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE

Sequences of transactions

[ABC:3|DE:4]

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Sequence association rule mining
Sequence Association Rule Mining

  • SPADE (Sequential Pattern Discovery using Equivalence classes)

  • Constrained sequence mining (SPIRIT)

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


Bibliography
Bibliography

  • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience.

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


a11

a22

a12

1

2

a21

a13

a32

a23

a31

3

a33

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


v1

v1

a11

a22

a12

b21

b11

v2

b22

v2

1

2

b12

b23

a21

v3

b24

b13

v3

b14

a32

a13

a23

v4

v4

a31

v3

v1

3

b33

b31

b34

b32

v2

a33

v4

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


vk

1(2)

b2k

1

1

1

1

1

…………

a12

2(2)

a22

2

2

2

2

2

…………

a32

3(2)

3

3

3

3

3

…………

.

.

.

ac2

.

.

.

.

.

.

.

.

.

.

.

.

c(2)

c

c

c

c

c

…………

t =

1

2

3

T-1

T

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


v3

v1

v3

v2

v0

0

0

0

0

0.0011

0

0.2 x 2

1

0.09

0.0052

0.0024

0

0.3 x 0.3

1

0.1 x 0.1

0

0.01

0.0077

0.0002

0

2

0.4 x 0.5

0.2

0

0.0057

0.0007

0

3

t =

0

1

2

3

4

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


3

4

5

6

7

1

2

0

/v/

/i/

/t/

/e/

/r/

/b/

/i/

/-/

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


max(T)

0

0

0

0

0

0

…………

max(1)

1

1

1

1

1

1

…………

max(3)

max(T-1)

2

2

2

2

2

2

…………

max(2)

3

3

3

3

3

3

…………

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

c

c

c

c

c

c

…………

t =

1

2

3

4

T-1

T

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


v3

v1

v3

v2

v0

0

0

0

0

0.0011

0

1

0.09

0.0052

0.0024

0

1

0

0.01

0.0077

0.0002

0

2

0.2

0

0.0057

0.0007

0

3

t =

0

1

2

3

4

Data Mining – Sequences

H. Liu (ASU) & G Dong (WSU)


ad