Master Course. MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBAIBM Research Institute Universitat Politècnica de Catalunya. Contents.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
MSc Bioinformatics for
Health Sciences
H15: Algorithms on strings and sequences
Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Dep. de Llenguatges i Sistemes Informàtics
CEPBAIBM Research Institute
Universitat Politècnica de Catalunya
1. (Exact) String matching of one pattern
2. (Exact) String matching of many patterns
3. Extended string matching and regular expressions
4. Approximate string matching (Dynamic programming)
5. Pairwise and multiple alignment
6. Suffix trees
There are classes of characters represented by one
Symbol. For instace the IUPAC code for the
DNA alphabet is:
R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T}
B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)
1. Classes of characters in the tetx.
There are characters in the text that
represent sets of simbols
2. Classes of characters in the pattern.
There are characters in the text that
represent sets of simbols
Algorismes més eficients (Navarro & Raffinot)
 
64
32
16
Horspool
8
BOM
BNDM
4
Long. patró
2
w
2 4 8 16 32 64 128 256
A 4
C 5
G 2
T 1
R ?
…
N ?
Given the pattern ATGTA
the shift table is:
A 4
C 5
G 2
T 1
R 2
…
N ?
Suposem que el patró és ATGTA
La taula de salts seria:
Given the taxt :
G T A R T R N A A G G A …
A T G T A
A T G T A
A T G T A
A 4
C 5
G 2
T 1
R 2
…
N 1
Given the pattern ATGTA
and the shift table:
IGiven the text :
G T A R T R N A A G G A ...
A T G T A
A T G T A
A T G T A
A T G T A
A 4
C 5
G 2
T 1
R 2
…
N 1
Given the pattern ATGTA
and the shift table:
…
Algorismes més eficients (Navarro & Raffinot)
BNDM : Backward Nondeterministic Dawg Matching
 
BOM : Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. patró
2
w
2 4 8 16 32 64 128 256
Algorismes més eficients (Navarro & Raffinot)
BNDM : Backward Nondeterministic Dawg Matching
 
BOM : Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. patró
2
w
2 4 8 16 32 64 128 256
Com fa la comparació?
Text :
Patró :
Autòmata: Factor Oracle
Com es determina la següent posició de la finestra?
Comproba si el sufix és factor del patró
Però primer analitzem com fa la comparació…
G
T
A
G
T
T
A
G
T
A
I la cerca sobre el text :
G T A R T R N A A T G…
Com fa la comparació?
Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG
A T G T A T G
No és possible cap millora!
8
 
(5 mots)
WuManber
4
SBOM
Long. mínima
2
5 10 15 20 25 30 35 40 45
8
WuManber
(10 mots)
(100 mots)
4
SBOM
8
WuManber
Ad AC
2
SBOM
4
5 10 15 20 25 30 35 40 45
Ad AC
2
5 10 15 20 25 30 35 40 45
WuManber
8
(1000 mots)
SBOM
4
Ad AC
2
5 10 15 20 25 30 35 40 45
G
T
A
T
A
T
G
G
T
A
T
A
A
T
A
A
Search for the patterns ATGTATG,TATG,ATAAT,ATGTG
In the text: ARTGNCTATGTGACA…
<it’s not possible any improvment!
8
 
(5 mots)
WuManber
4
SBOM
Long. mínima
2
5 10 15 20 25 30 35 40 45
8
WuManber
(10 mots)
(100 mots)
4
SBOM
8
WuManber
Ad AC
2
SBOM
4
5 10 15 20 25 30 35 40 45
Ad AC
2
5 10 15 20 25 30 35 40 45
WuManber
8
(1000 mots)
SBOM
4
Ad AC
2
5 10 15 20 25 30 35 40 45
Algorismes més eficients (Navarro & Raffinot)
 
64
32
16
Horspool
8
BOM
BNDM
4
Long. patró
2
w
2 4 8 16 32 64 128 256
8
 
(5 mots)
WuManber
4
SBOM
Long. mínima
2
5 10 15 20 25 30 35 40 45
8
WuManber
(10 mots)
(100 mots)
4
SBOM
8
WuManber
Ad AC
2
SBOM
4
5 10 15 20 25 30 35 40 45
Ad AC
2
5 10 15 20 25 30 35 40 45
WuManber
8
(1000 mots)
SBOM
4
Ad AC
2
5 10 15 20 25 30 35 40 45
8
 
(5 mots)
WuManber
4
SBOM
Long. mínima
2
5 10 15 20 25 30 35 40 45
8
WuManber
(10 mots)
(100 mots)
4
SBOM
8
WuManber
Ad AC
2
SBOM
4
5 10 15 20 25 30 35 40 45
Ad AC
2
5 10 15 20 25 30 35 40 45
WuManber
8
(1000 mots)
SBOM
4
Ad AC
2
5 10 15 20 25 30 35 40 45
8
 
(5 mots)
WuManber
4
SBOM
Long. mínima
2
5 10 15 20 25 30 35 40 45
8
WuManber
(10 mots)
(100 mots)
4
SBOM
8
WuManber
Ad AC
2
SBOM
4
5 10 15 20 25 30 35 40 45
Ad AC
2
5 10 15 20 25 30 35 40 45
WuManber
8
(1000 mots)
SBOM
4
Ad AC
2
5 10 15 20 25 30 35 40 45
Una expressió regular ℛés una cadena sobre
ΣU { ε, , · , * , (, ) } definida recursivament com:
ε és una expressió regular
Un caràcter de Σés una expressió regular
( ℛ ) és una expressió regular
ℛ1 ·ℛ2és una expressió regular
ℛ1 ℛ2és una expressió regular
ℛ *és una expressió regular
El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir de l’expressió regular.
El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen al respectiu llenguatge regular.
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
The Edit distance is related with the best alignment of strings
Given
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2
which is the best alignment in every case?
ACT
AC
ACTTG and ATCTG:
ACTTG
ATCTG
ACT  TG
A  TCTG
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
The Edit distance is related with the best alignment of strings
Given
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2
which is the best alignment in every case?
ACT
AC
ACTTG and ATCTG:
ACTTG
ATCTG
ACT  TG
A  TCTG
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
The Edit distance is related with the best alignment of strings
Given
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2
which is the best alignment in every case?
ACT
AC
ACTTG and ATCTG:
ACTTG
ATCTG
ACT  TG
A  TCTG
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
C T A C T A C T A C G T
A
C
T
G
A
The cell contains the distance between AC and CTACT.
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
The Edit distance is related with the best alignment of strings
Given
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2
which is the best alignment in every case?
ACT
AC
ACTTG and ATCTG:
ACTTG
ATCTG
ACT  TG
A  TCTG
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
The Edit distance is related with the best alignment of strings
Given
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2
which is the best alignment in every case?
ACT
AC
ACTTG and ATCTG:
ACTTG
ATCTG
ACT  TG
A  TCTG
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
We accept three types of errors:
1. Mismatch: ACCGTGAT ACCGAGAT
2. Insertion: ACCGTGAT ACCGATGAT
3. Deletion: ACCGTGAT ACCGGAT
The edit distance d between two strings is the
minimum number of
substitutions,insertions and deletions
needed to transform the first string into the second one
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2
d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
The Edit distance is related with the best alignment of strings
Given
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2
which is the best alignment in every case?
ACT
AC
ACTTG and ATCTG:
ACTTG
ATCTG
ACT  TG
A  TCTG
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
C T A C T A C T A C G T
A
C
T
G
A
The cell contains the distance between AC and CTACT.
C T A C T A C T A C G T
0 1 2 3 4 5 6 7 8 …
A
C
T
G
A
     
CTACTA
C T A C T A C T A C G T
0 1 2 3 4 5 6 7 8 …
A 1
C 2
T 3
G…
A
ACT
  

C
C
C
C

BA(AC,CTA)
BA(A,CTA)
BA(A,CTAC)
C T A C T A C T A C G T
0 1 2 3 4 5 6 7 8 …
A 1
C 2
T 3
G
A
C T A C T A C T A C G T
A
C
T
G
A
d(AC,CTA)+1
d(A,CTA)
BA(AC,CTAC)= best
d(AC,CTAC)=min
d(A,CTAC)+1
Connect to
http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm
and use the global method.
How this algorithm can be applied
to the approximate search?
to the Kapproximate string searching?
C T A C T A C T A C G T A C T G G T G A A …
A
C
T
G
A
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters
C T A C T A C T A C G T A C T G G T G A A …
A
C
T
G
A
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters
Pairwise and multiple alignment
+

s(A,CTAC)2
s(AC,CTACT)=maximum s(A,CTA) 1
s(AC,CTA)2
Edit distance:
match=0 mismatch=1 indel=1
d(A,CTAC)+1
d(AC,CTACT)=minimum d(A,CTA)….+1
d(AC,CTA)+1
Similarity:
match=1 mismatch=1 indel=2
S2
A
C
A
1
S3
__
S1
What happens with three strings?
Let n be their lenght, then the cost becomes
O(n3)
O(23)
O(32)
And with k strings?
O(nk 2k k2)
Programs of multialignment use different heuristics:
http://www.ebi.ac.uk/clustalw
http://igsserver.cnrsmrs.fr/Tcoffee_cgi/index.cgi