t cniques i eines bioinform tiques n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Tècniques i Eines Bioinformàtiques PowerPoint Presentation
Download Presentation
Tècniques i Eines Bioinformàtiques

Loading in 2 Seconds...

play fullscreen
1 / 34

Tècniques i Eines Bioinformàtiques - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

Tècniques i Eines Bioinformàtiques. Bioinformatics, Sequence and Genome Analysis David W. Mount Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Tècniques i Eines Bioinformàtiques' - kristen-wong


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
t cniques i eines bioinform tiques
Tècniques i Eines Bioinformàtiques
  • Bioinformatics, Sequence and Genome Analysis
  • David W. Mount
  • Flexible Pattern Matching in Strings (2002)
  • Gonzalo Navarro and Mathieu Raffinot
  • Algorithms on strings (2001)
  • M. Crochemore, C. Hancart and T. Lecroq
  • http://www-igm.univ-mlv.fr/~lecroq/string/index.html
algorismes i estructures eficients de cerca
Algorismes i estructures eficients de cerca

String matching: definition of the problem (text,pattern)

depends on what we have: text or patterns

  • Exact matching:
  • The patterns ---> Data structures for the patterns
  • 1 pattern ---> The algorithm depends on |p| and ||
  • k patterns ---> The algorithm depends on k, |p| and ||
  • The text ----> Data structure for the text (suffix tree, ...)
  • Approximate matching:
  • Dynamic programming
  • Sequence alignment (pairwise and multiple)
approximate string matching
Approximate string matching

For instance, given the sequence

CTACTACTACGTGACTAATACTGATCGTAGCTAC…

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?

edit distance
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

edit distance1
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

0

1

2

3

1

2

edit distance and alignment of strings
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT
  • ACT
  • ACT and AT: ACT
  • A -T
  • ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG

Then, the alignment suggest the substitutions, insertions and deletions to transform one string into the other

edit distance and alignment of strings1
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”

edit distance and alignment of strings2
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

edit distance and alignment of strings3
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

edit distance and alignment of strings4
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

The cell contains the distance between AC and CTACT.

edit distance and alignment of strings5
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

?

edit distance and alignment of strings6
Edit distance and alignment of strings

C T A C T A C T A C G T

0

A

C

T

G

A

?

edit distance and alignment of strings7
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1

A

C

T

G

A

?

-

C

edit distance and alignment of strings8
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2

A

C

T

G

A

?

- -

CT

edit distance and alignment of strings9
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A

C

T

G

A

- - - - - -

CTACTA

edit distance and alignment of strings10
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A ?

C ?

T ?

G

A

edit distance and alignment of strings11
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A 1

C 2

T 3

G…

A

ACT

- - -

edit distance and alignment of strings12
Edit distance and alignment of strings

-

C

C

C

C

-

BA(AC,CTA)

BA(A,CTA)

BA(A,CTAC)

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A 1

C 2

T 3

G

A

C T A C T A C T A C G T

A

C

T

G

A

d(AC,CTA)+1

d(A,CTA)

BA(AC,CTAC)= best

d(AC,CTAC)=min

d(A,CTAC)+1

edit distance and alignment of strings13
Edit distance and alignment of strings

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the global method.

edit distance and alignment of strings14
Edit distance and alignment of strings

How this algorithm can be applied

to the approximate search?

to the K-approximate string searching?

k approximate string searching
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell …

k approximate string searching1
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters

k approximate string searching2
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters

k approximate string searching3
K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

k approximate string searching4
K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A …

0

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

k approximate string searching5
K-approximate string searching

* * * * * * C T A C G T A C T G G T G A A …

0

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then…

k approximate string searching6
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters…

…no matter where they appears in the text, then

k approximate string searching7
K-approximate string searching

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the semi-global method.

bioinformatics
Bioinformatics

Pairwise and multiple alignment

pairwise alignment
Pairwise alignment

+

-

s(A,CTAC)-2

s(AC,CTACT)=maximum s(A,CTA) 1

s(AC,CTA)-2

Edit distance:

match=0 mismatch=1 indel=1

d(A,CTAC)+1

d(AC,CTACT)=minimum d(A,CTA)….+1

d(AC,CTA)+1

Similarity:

match=1 mismatch=-1 indel=-2

pairwise alignment1
Pairwise alignment

Connect to

http://alggen.lsi.upc.es

Links to TEACHING EMBER LePA

pairwise to multiple alignment
Pairwise to multiple alignment

S2

A

C

A

-1

S3

__

S1

What happens with three strings?

Let n be their lenght, then the cost becomes

O(n3)

“O(23)”

“O(32)”

And with k strings?

O(nk 2k k2)

multiple alignment
Multiple alignment

Programs of multialignment use different heuristics:

  • Clustal (Progressive alignment)

http://www.ebi.ac.uk/clustalw

  • TCoffee (Progressive alignment + data bases)

http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi

  • HMM (Hidden Markov Models)
multiple alignment1
Multiple alignment

Connect to

http://alggen.lsi.upc.es/

and follow the links TEACHING EMBER.