Approaches to Sequence Analysis. Data {GTCAT,GTTGGT,GTCA,CTCA}. Parsimony, similarity, optimisation. . TKF91  The combined substitution/indel process. Acceleration of Basic Algorithm Many Sequence Algorithm MCMC Approaches Statistical Alignment and Footprinting. GTCAT GTTGGT GTCA
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Approaches to Sequence Analysis
Data {GTCAT,GTTGGT,GTCA,CTCA}
Parsimony, similarity, optimisation.
TKF91  The combined substitution/indel process.
Acceleration of Basic Algorithm
Many Sequence Algorithm
MCMC Approaches
Statistical Alignment and Footprinting
GTCAT
GTTGGT
GTCA
CTCA
Ideal Practice: 1 phase analysis.
Actual Practice: 2 phase analysis.
statistics
s1
s2
s3
s4
#   
## # #
#
T = t
#
#
#
#
s1
r
s2
s1
s2
s1
s2
ThorneKishinoFelsenstein (1991) Process
A # C G
*
1. P(s) = (1l/m)(l/m)l pA#A* .. *pT #T l =length(s)
2. Time reversible:
 # # # #
k
*    
* # # # #
k
l & m into Alignment Blocks
A. Amino Acids Ignored:
#   
## # #
k
emt[1lb](lb)k1
[1lbmb](lb)k
[1lb](lb)k
p’k(t)
pk(t)
p’’k(t)
b=[1e(lm)t]/[mle(lm)t]
p’0(t)= mb(t)
B. Amino Acids Considered:
T   
RQ S W Pt(T>R)*pQ*..*pW*p4(t)
4
#   ... 
# #*# ... #
1 k1
#   ... 
# # # ... #
1 k+1
#   ... 
# # # ... #
1 k
pk
#   ... 
# #*# ... #
1 k1
#   ... 
# # # ... #
1 k+1
Dpk = Dt*[l*(k1) pk1 + m*k*pk+1  (l+m)*k*pk]
Dpk = Dt*[l*(k1) pk1 + m*k*pk+1  (l+m)*k*pk]
Dp’k=Dt*[l*(k1) p’k1+m*(k+1)*p’k+1(l+m)*k*p’k+m*pk+1]
Dp’’k=Dt*[l*k*p’’k1+m*(k+1)*p’’k+1 [(k+1)l+km]*p’’k]
Differential Equations for pfunctions
#   ... 
# # # ... #
#    ... 
 # # # ... #
*    ... 
* # # # ... #
Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1
p1(0)= p0’’(0)= 1. p’0 (0)= 0
Basic Pairwise Recursion (O(length3))
i
j
Survives:
Dies:
i1
i
i1
i
j1
j
j
i1
i
i
j2
j
i1
j
j1
……………………
……………………
……………………
emt[1lb](lb)k1, where
……………………
……………………
b=[1e(lm)t]/[mle(lm)t]
0… j (j+1) cases
1… j (j) cases
Basic Pairwise Recursion (O(length3))
survive
death
j
(i1,j)
j1
(i1,j1)
Initial condition:
p’’=s2[1:j]
…………..
(i1,jk)
…………..
…………..
i1
i
(i,j)
Fundamental Pairwise Recursion.
P(s1i>s2j) = p’0P(s1i1>s2j) +
Initial Condition P(s10 >s2j) = pj’’ps2[1:j]
Simplification: Ri,j=(p1f(s1[i],s2[j])+p’1ps2j[j])P(s1i1>s2j1)
+ lb ps2[j]Ri,j1
P(s1i>s2j) = Ri,j + p’0 P(s1i>s2j1)
P(s1i>s2j) =
p’0P(s1i1>s2j)+
lbP(s1i>s2j1) +
(p1f(s1[i],s2[j]+p’1ps2j[j] lb ps2j[j] ))P(s1i1>s2j1)
Probability of observationP(s1,s2) = P(s1) P(s1 >s2)
# E
* l/m 1 l/m
#l/m 1 l/m
* # # # #
p’’ function generator

#
E
*    
* # # # #
lb 1 lb
lb 1 lb
*
*

#
p’/p function generator
#    
# # # # #

#
E
lb 1 lb
1mb mb
#
#
#

#    
 # # # #
lb 1 lb

#
Markov Chains Generating the pfunctions
Steel and Hein,2001 + Holmes and Bruno,2001
T
An HMM Generating Alignments
 # # E
# #  E
*
* lb l/m (1 lb)em l/m (1 lb)(1 em) (1 l/m) (1 lb)

# lb l/m (1 lb)em l/m (1 lb)(1 em) (1 l/m) (1 lb)
_
#lb l/m (1 lb)em l/m (1 lb)(1 em) (1 l/m) (1 lb)
#
 lb
C
C
A
C
Emit functions:
e(##)= p(N1)f(N1,N2)
e(#)= p(N1),e(#)= p(N2)
p(N1)  equilibrium prob. of N
f(N1,N2)  prob. that N1 evolves into N2
Better Numerical Search ~10100
Ex.: good start guess, 28 evaluations, 3 iterations
Accelleration of Pairwise Algorithm
(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
Simpler Recursion ~310
Faster Computers ~250
1991>2000 ~106
aglobin (141) and bglobin (146)
(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
430.108 : log(aglobin)
327.320 : log(aglobin >bglobin)
747.428 : log(aglobin, bglobin) = log(l(sumalign))
l*t: 0.0371805 +/ 0.0135899
m*t: 0.0374396 +/ 0.0136846
s*t: 0.91701 +/ 0.119556
E(Length) E(Insertions,Deletions) E(Substitutions)
143.499 5.37255 131.59
Maximum contributing alignment:
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALT
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS
NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Ratio l(maxalign)/l(sumalign) = 0.00565064
VLSPADNAL.....DLHAHKR 141 AA long
*########### …. ### 141 AA long
2 108 years
2 107 years
2 109 years
*########### …. ###
*########### …. ###
???????????????????? k AA long
109 years
The invasion of the immortal link
(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
D(s1,s2) is evaluated in D(s1,s2*)
a, myoglobin homology tests
Random s1 = ATWYFCAKAC
s2* = LTAYKADCWLE
*
Real s1 = ATWYFCAKAC
s2 = ETWYKCALLAD
*** ** *
Wi,j= ln(pi*P2.5i,j/(pi*pj))
1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart.
2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary.
Sample random alignments from real sequences
Sample random alignments from random sequences
cgtgttacatatatatagccgatagccg
cgtgttacatatatatagccgatagccg
cgtgttacatatatatagccgatagccg
cgtgttacatatatatagccgatagccg
Compare real and random distribution using
Chisquare statistic.
Goodnessoffit of TKF91
Algorithm for alignment on star tree (O(length6))(Steel & Hein, 2001)
*ACGC
*TT GT
s2
s1
a
s3
*ACG GT
*######
* (l/m)
a1a2
* *
# #
# 
 #
# #
 #
TGA
ACCT
s1
s3
a1
a2
s2
s4
GTT
ACG
ii. The alignment of ancestral alignment columns to leaf sequences was known
The problem would be simpler if:
How to sum over all possible ancestral sequences and their alignments?:
A Markov chain generating ancestral alignments can solve the problem!!
 # # E
# #  E
*
* lb l/m (1 lb)em l/m (1 lb)(1 em) (1 l/m) (1 lb)
#
# lb l/m (1 lb)em l/m (1 lb)(1 em) (1 l/m) (1 lb)
_
#lb l/m (1 lb)em l/m (1 lb)(1 em) (1 l/m) (1 lb)
#
 lb
Generating Ancestral Alignments
a1 *
a2 *
#
#
l/m (1 lb)em
E
E
(1 l/m) (1 lb)

#
lb
”Remove 1st step”  recursion:
S
E
”Remove last step”  recursion:
Last/First step removal are inequivalent, but have the same complexities. First step algorithm is the simplest.
#
#
#

#
=
Where P’(kS i,H) =
F(kSi,H)
Sequence Recursion: First Step Removal
Pa(Sk): Epifixes (S[k+1:l]) starting in given MC starts in a.
Contrasting Probability versus Distance Recursions
Probability:
Distance (Sankoff, 1973):
A
#
#
#
#

#
C
=
=
+

A
15 cases
Maximum likelihood phylogeny and alignment
Gerton Lunter
Istvan Miklos
Alexei Drummond
Yun Song
Human alpha hemoglobin;Human beta hemoglobin;
Human myoglobin
Bean leghemoglobin
Probability of data e1560.138
Probability of data and alignment e1593.223
Probability of alignment given data 4.279 * 1015 = e33.085
Ratio of insertiondeletions to substitutions: 0.0334
Gibbs Samplers for Statistical Alignment
Holmes & Bruno (2001):
Sampling Ancestors to pairs.
Jensen & Hein (in press):
Sampling nodes adjacent to triples
Slower basic operation, faster mixing
As in Drummond et al. 2002
MetropolisHastings Statistical Alignment.
Lunter, Drummond, Miklos, Jensen & Hein, 2005
The alignment moves:
QSTQCCS
SCCS
QSTQC
QSTQC
TNQHVSCTGN
GNHVSCTGK
TNQHSCTLN
TNQHVSCTLN
ALITLGG
ALLTLTTLGG
TLTSLGA
ALLGLTSLGA
We choose a random window in the current alignment
Then delete all gaps so we get back subsequences
QSTQCCS
SCCS
QSTQC
QSTQC
TNQHVSCTGN
GNHVSCTGK
TNQHSCTLN
TNQHVSCTLN
ALITLGG
ALLTLTTLGG
TLTSLGA
ALLGLTSLGA
QSTQCCS
SCCS
QSTQC
QSTQC
TNQHVSCTGN
GNHVSCTGK
TNQHSCTLN
TNQHVSCTLN
ALITLGG
ALLTLTTLGG
TLTSLGA
ALLGLTSLGA
Stochastically realign this part
MetropolisHastings Statistical Alignment
Lunter, Drummond, Miklos, Jensen & Hein, 2005
positions
HMM:
1
n
1
sequences
k
slow  rs
HMM:
fast  rf
A
C
G
T
ATG
AC
A
Many unaligned sequences related by a known phylogeny:
A C
sometimes

#
#
#
#

HMM:
A T G
The Basics of Footprinting II
Statistical Alignment andFootprinting.
acgtttgaaccgag
1
acgtttgaaccgag
sequences
sequences
1
k
k
Comment:The AHMM * SHMM is an approximate approach as SHMM does not include an evolutionary model
acgtttgaaccgag
1
sequences
Alignment HMM
k
Ex.:
nnnnnnnnnnn
Alignment HMM
Signal HMM
nnnnnnnnnnn
S
F
F
F
S
S
0.1
0.1
0.1
0.1
0.9
0.9
F
S
SF
FS
SS
FF
(A,S)
F F S S F
Alignment HMM
?
Structure HMM
“Structure” does not stem from an evolutionary model
using the HMM at the alignment will give other distributions on the leaves
using the HMM at the root will give other distributions on the leaves
Start
M2
=
M2
M3
M1
Stop
Alignment HMM
Structure HMM
Previouslyidentified binding sites indicated by colored boxes
References Statistical Alignment
TKF92, Long Indel, Explain HMM, Multiple Recursion, Hidden State Space, 1state recursion and other reductions, competing algorithms,