Master course
Download
1 / 86

Master Course - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Master Course. MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya. Contents.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Master Course' - tibor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Master course
Master Course

MSc Bioinformatics for

Health Sciences

H15: Algorithms on strings and sequences

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Dep. de Llenguatges i Sistemes Informàtics

CEPBA-IBM Research Institute

Universitat Politècnica de Catalunya


Contents
Contents

1. (Exact) String matching of one pattern

2. (Exact) String matching of many patterns

3. Extended string matching and regular expressions

4. Approximate string matching (Dynamic programming)

5. Pairwise and multiple alignment

6. Suffix trees


Master course1
Master Course

Second lecture:

First part:

Extended string matching


Extended string matching
Extended string matching

There are classes of characters represented by one

Symbol. For instace the IUPAC code for the

DNA alphabet is:

R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T}

B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)

1. Classes of characters in the tetx.

There are characters in the text that

represent sets of simbols

2. Classes of characters in the pattern.

There are characters in the text that

represent sets of simbols


Classes in the text
Classes in the text

Algorismes més eficients (Navarro & Raffinot)

| |

64

32

16

Horspool

8

BOM

BNDM

4

Long. patró

2

w

2 4 8 16 32 64 128 256


Classes in the text horspool example
Classes in the text :Horspool example

A 4

C 5

G 2

T 1

R ?

N ?

Given the pattern ATGTA

the shift table is:


Classes in the text horspool example1
Classes in the text :Horspool example

A 4

C 5

G 2

T 1

R 2

N ?

Suposem que el patró és ATGTA

La taula de salts seria:


Classes in the text horspool example2
Classes in the text :Horspool example

Given the taxt :

G T A R T R N A A G G A …

A T G T A

A T G T A

A T G T A

A 4

C 5

G 2

T 1

R 2

N 1

Given the pattern ATGTA

and the shift table:


Classes in the text horspool example3
Classes in the text :Horspool example

IGiven the text :

G T A R T R N A A G G A ...

A T G T A

A T G T A

A T G T A

A T G T A

A 4

C 5

G 2

T 1

R 2

N 1

Given the pattern ATGTA

and the shift table:


Classes in the text1
Classes in the text

Algorismes més eficients (Navarro & Raffinot)

BNDM : Backward Nondeterministic Dawg Matching

| |

BOM : Backward Oracle Matching

64

32

16

Horspool

8

BOM

BNDM

4

Long. patró

2

w

2 4 8 16 32 64 128 256


Alg cerca exacta d un patr text on line
Alg. Cerca exacta d’un patró (text on-line)

Algorismes més eficients (Navarro & Raffinot)

BNDM : Backward Nondeterministic Dawg Matching

| |

BOM : Backward Oracle Matching

64

32

16

Horspool

8

BOM

BNDM

4

Long. patró

2

w

2 4 8 16 32 64 128 256


Classes in the text bom
Classes in the text: BOM

Com fa la comparació?

Text :

Patró :

Autòmata: Factor Oracle

Com es determina la següent posició de la finestra?

Comproba si el sufix és factor del patró

Però primer analitzem com fa la comparació…


Classes in the text bom example
Classes in the text: BOM example

G

T

A

G

T

T

A

G

T

A

I la cerca sobre el text :

G T A R T R N A A T G…

Com fa la comparació?

Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG

A T G T A T G

No és possible cap millora!


Alg cerca exacta de molts patrons
Alg. Cerca exacta de molts patrons

8

| |

(5 mots)

Wu-Manber

4

SBOM

Long. mínima

2

5 10 15 20 25 30 35 40 45

8

Wu-Manber

(10 mots)

(100 mots)

4

SBOM

8

Wu-Manber

Ad AC

2

SBOM

4

5 10 15 20 25 30 35 40 45

Ad AC

2

5 10 15 20 25 30 35 40 45

Wu-Manber

8

(1000 mots)

SBOM

4

Ad AC

2

5 10 15 20 25 30 35 40 45


Classes in the text set horspool
Classes in the text: Set Horspool

G

T

A

T

A

T

G

G

T

A

T

A

A

T

A

A

Search for the patterns ATGTATG,TATG,ATAAT,ATGTG

In the text: ARTGNCTATGTGACA…

<it’s not possible any improvment!


Classes in the text2
Classes in the text

8

| |

(5 mots)

Wu-Manber

4

SBOM

Long. mínima

2

5 10 15 20 25 30 35 40 45

8

Wu-Manber

(10 mots)

(100 mots)

4

SBOM

8

Wu-Manber

Ad AC

2

SBOM

4

5 10 15 20 25 30 35 40 45

Ad AC

2

5 10 15 20 25 30 35 40 45

Wu-Manber

8

(1000 mots)

SBOM

4

Ad AC

2

5 10 15 20 25 30 35 40 45


Classes in the pattern
Classes in the pattern

Algorismes més eficients (Navarro & Raffinot)

| |

64

32

16

Horspool

8

BOM

BNDM

4

Long. patró

2

w

2 4 8 16 32 64 128 256


Classes in the text3
Classes in the text

8

| |

(5 mots)

Wu-Manber

4

SBOM

Long. mínima

2

5 10 15 20 25 30 35 40 45

8

Wu-Manber

(10 mots)

(100 mots)

4

SBOM

8

Wu-Manber

Ad AC

2

SBOM

4

5 10 15 20 25 30 35 40 45

Ad AC

2

5 10 15 20 25 30 35 40 45

Wu-Manber

8

(1000 mots)

SBOM

4

Ad AC

2

5 10 15 20 25 30 35 40 45


Alg cerca exacta de molts patrons1
Alg. Cerca exacta de molts patrons

8

| |

(5 mots)

Wu-Manber

4

SBOM

Long. mínima

2

5 10 15 20 25 30 35 40 45

8

Wu-Manber

(10 mots)

(100 mots)

4

SBOM

8

Wu-Manber

Ad AC

2

SBOM

4

5 10 15 20 25 30 35 40 45

Ad AC

2

5 10 15 20 25 30 35 40 45

Wu-Manber

8

(1000 mots)

SBOM

4

Ad AC

2

5 10 15 20 25 30 35 40 45


Alg cerca exacta de molts patrons2
Alg. Cerca exacta de molts patrons

8

| |

(5 mots)

Wu-Manber

4

SBOM

Long. mínima

2

5 10 15 20 25 30 35 40 45

8

Wu-Manber

(10 mots)

(100 mots)

4

SBOM

8

Wu-Manber

Ad AC

2

SBOM

4

5 10 15 20 25 30 35 40 45

Ad AC

2

5 10 15 20 25 30 35 40 45

Wu-Manber

8

(1000 mots)

SBOM

4

Ad AC

2

5 10 15 20 25 30 35 40 45


Master course2
Master Course

Second lecture:

Second part:

Regular expressions matching


Expressions regulars
Expressions regulars

Una expressió regular ℛés una cadena sobre

ΣU { ε, |, · , * , (, ) } definida recursivament com:

ε és una expressió regular

Un caràcter de Σés una expressió regular

( ℛ ) és una expressió regular

ℛ1 ·ℛ2és una expressió regular

ℛ1 |ℛ2és una expressió regular

ℛ *és una expressió regular


Llenguatge regular
Llenguatge regular

El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir de l’expressió regular.

El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen al respectiu llenguatge regular.


Master course3
Master Course

Second lecture:

Third part:

Approximate string matching


Approximate string matching
Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?


Edit distance
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance1
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance2
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2


Edit distance and alignment of strings
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT

    ACT

  • ACT and AC: ACT

    AC-

ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG


Edit distance and alignment of strings1
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”


Approximate string matching1
Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?


Edit distance3
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Approximate string matching2
Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?


Edit distance4
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance5
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance6
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2


Edit distance and alignment of strings2
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT

    ACT

  • ACT and AC: ACT

    AC-

ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG


Edit distance and alignment of strings3
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”


Edit distance7
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance8
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2


Edit distance and alignment of strings4
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT

    ACT

  • ACT and AC: ACT

    AC-

ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG


Edit distance and alignment of strings5
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”


Edit distance and alignment of strings6
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A


Edit distance and alignment of strings7
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A


Edit distance and alignment of strings8
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

The cell contains the distance between AC and CTACT.


Approximate string matching3
Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?


Edit distance9
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance10
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance11
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2


Edit distance and alignment of strings9
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT

    ACT

  • ACT and AC: ACT

    AC-

ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG


Edit distance and alignment of strings10
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”


Approximate string matching4
Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?


Edit distance12
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Approximate string matching5
Approximate string matching

For instance, given the sequence

CTACTACTACGTCTATACTGATCGTAGCTACTACATGC

search for the pattern ACTGA allowing one error…

… but what is the meaning of “one error”?


Edit distance13
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance14
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance15
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2


Edit distance and alignment of strings11
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT

    ACT

  • ACT and AC: ACT

    AC-

ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG


Edit distance and alignment of strings12
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”


Edit distance16
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=


Edit distance17
Edit distance

Indel

We accept three types of errors:

1. Mismatch: ACCGTGAT ACCGAGAT

2. Insertion: ACCGTGAT ACCGATGAT

3. Deletion: ACCGTGAT ACCGGAT

The edit distance d between two strings is the

minimum number of

substitutions,insertions and deletions

needed to transform the first string into the second one

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2

d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2


Edit distance and alignment of strings13
Edit distance and alignment of strings

The Edit distance is related with the best alignment of strings

Given

d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2

which is the best alignment in every case?

  • ACT and ACT : ACT

    ACT

  • ACT and AC: ACT

    AC-

ACTTG and ATCTG:

ACTTG

ATCTG

ACT - TG

A - TCTG


Edit distance and alignment of strings14
Edit distance and alignment of strings

But which is the distance between the strings

ACGCTATGCTATACG and ACGGTAGTGACGC?

… and the best alignment between them?

1966 was the first time this problem was discussed…

and the algorithm was proposed in 1968,1970,…

using the technique called “Dynamic programming”


Edit distance and alignment of strings15
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A


Edit distance and alignment of strings16
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A


Edit distance and alignment of strings17
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

The cell contains the distance between AC and CTACT.


Edit distance and alignment of strings18
Edit distance and alignment of strings

C T A C T A C T A C G T

A

C

T

G

A

?


Edit distance and alignment of strings19
Edit distance and alignment of strings

C T A C T A C T A C G T

0

A

C

T

G

A

?


Edit distance and alignment of strings20
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1

A

C

T

G

A

?

-

C


Edit distance and alignment of strings21
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2

A

C

T

G

A

?

- -

CT


Edit distance and alignment of strings22
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A

C

T

G

A

- - - - - -

CTACTA


Edit distance and alignment of strings23
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A ?

C ?

T ?

G

A


Edit distance and alignment of strings24
Edit distance and alignment of strings

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A 1

C 2

T 3

G…

A

ACT

- - -


Edit distance and alignment of strings25
Edit distance and alignment of strings

-

C

C

C

C

-

BA(AC,CTA)

BA(A,CTA)

BA(A,CTAC)

C T A C T A C T A C G T

0 1 2 3 4 5 6 7 8 …

A 1

C 2

T 3

G

A

C T A C T A C T A C G T

A

C

T

G

A

d(AC,CTA)+1

d(A,CTA)

BA(AC,CTAC)= best

d(AC,CTAC)=min

d(A,CTAC)+1


Edit distance and alignment of strings26
Edit distance and alignment of strings

Connect to

http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm

and use the global method.


Edit distance and alignment of strings27
Edit distance and alignment of strings

How this algorithm can be applied

to the approximate search?

to the K-approximate string searching?


K approximate string searching
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell …


K approximate string searching1
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters


K approximate string searching2
K-approximate string searching

C T A C T A C T A C G T A C T G G T G A A …

A

C

T

G

A

This cell gives the distance between (ACTGA, CT…GTA)…

…but we only are interested in the last characters


Master course4
Master Course

Second lecture:

Fourth part:

Pairwise and multiple alignment


Bioinformatics
Bioinformatics

Pairwise and multiple alignment


Pairwise alignment
Pairwise alignment

+

-

s(A,CTAC)-2

s(AC,CTACT)=maximum s(A,CTA) 1

s(AC,CTA)-2

Edit distance:

match=0 mismatch=1 indel=1

d(A,CTAC)+1

d(AC,CTACT)=minimum d(A,CTA)….+1

d(AC,CTA)+1

Similarity:

match=1 mismatch=-1 indel=-2


Pairwise alignment1
Pairwise alignment

Connect to

http://alggen.lsi.upc.es

Links to TEACHING EMBER LePA


Pairwise to multiple alignment
Pairwise to multiple alignment

S2

A

C

A

-1

S3

__

S1

What happens with three strings?

Let n be their lenght, then the cost becomes

O(n3)

O(23)

O(32)

And with k strings?

O(nk 2k k2)


Multiple alignment
Multiple alignment

Programs of multialignment use different heuristics:

  • Clustal (Progressive alignment)

    http://www.ebi.ac.uk/clustalw

  • TCoffee (Progressive alignment + data bases)

    http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi

  • HMM (Hidden Markov Models)


Multiple alignment1
Multiple alignment

Connect to

http://alggen.lsi.upc.es/

and follow the links TEACHING EMBER.


ad