slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Approaches to Sequence Analysis PowerPoint Presentation
Download Presentation
Approaches to Sequence Analysis

Loading in 2 Seconds...

play fullscreen
1 / 35

Approaches to Sequence Analysis - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

Approaches to Sequence Analysis. Data {GTCAT,GTTGGT,GTCA,CTCA}. Parsimony, similarity, optimisation. . TKF91 - The combined substitution/indel process. Acceleration of Basic Algorithm Many Sequence Algorithm MCMC Approaches Statistical Alignment and Footprinting. GT-CAT GTTGGT GT-CA-

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Approaches to Sequence Analysis' - foy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Approaches to Sequence Analysis

Data {GTCAT,GTTGGT,GTCA,CTCA}

Parsimony, similarity, optimisation.

TKF91 - The combined substitution/indel process.

Acceleration of Basic Algorithm

Many Sequence Algorithm

MCMC Approaches

Statistical Alignment and Footprinting

GT-CAT

GTTGGT

GT-CA-

CT-CA-

Ideal Practice: 1 phase analysis.

Actual Practice: 2 phase analysis.

statistics

s1

s2

s3

s4

slide2

T= 0

# - - -

## # #

#

T = t

#

#

#

#

s1

r

s2

s1

s2

s1

s2

Thorne-Kishino-Felsenstein (1991) Process

A # C G

*

  • (birth rate) < m(death rate)

1. P(s) = (1-l/m)(l/m)l pA#A* .. *pT #T l =length(s)

2. Time reversible:

slide3

# - - - -

- # # # #

k

* - - - -

* # # # #

k

l & m into Alignment Blocks

A. Amino Acids Ignored:

# - - -

## # #

k

e-mt[1-lb](lb)k-1

[1-lb-mb](lb)k

[1-lb](lb)k

p’k(t)

pk(t)

p’’k(t)

b=[1-e(l-m)t]/[m-le(l-m)t]

p’0(t)= mb(t)

B. Amino Acids Considered:

T - - -

RQ S W Pt(T-->R)*pQ*..*pW*p4(t)

4

  • T - - - -
  • R Q S WpR *pQ*..*pW*p’4(t)
  • 4
slide4

One block derivation

# - - ... -

# #*# ... #

1 k-1

# - - ... -

# # # ... #

1 k+1

# - - ... -

# # # ... #

1 k

pk

# - - ... -

# #*# ... #

1 k-1

# - - ... -

# # # ... #

1 k+1

Dpk = Dt*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk]

slide5

Dpk = Dt*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk]

Dp’k=Dt*[l*(k-1) p’k-1+m*(k+1)*p’k+1-(l+m)*k*p’k+m*pk+1]

Dp’’k=Dt*[l*k*p’’k-1+m*(k+1)*p’’k+1- [(k+1)l+km]*p’’k]

Differential Equations for p-functions

# - - ... -

# # # ... #

# - - - ... -

- # # # ... #

* - - - ... -

* # # # ... #

Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1

p1(0)= p0’’(0)= 1. p’0 (0)= 0

slide6

Basic Pairwise Recursion (O(length3))

i

j

Survives:

Dies:

i-1

i

i-1

i

j-1

j

j

i-1

i

i

j-2

j

i-1

j

j-1

……………………

……………………

……………………

e-mt[1-lb](lb)k-1, where

……………………

……………………

b=[1-e(l-m)t]/[m-le(l-m)t]

0… j (j+1) cases

1… j (j) cases

slide7

Basic Pairwise Recursion (O(length3))

survive

death

j

(i-1,j)

j-1

(i-1,j-1)

Initial condition:

p’’=s2[1:j]

…………..

(i-1,j-k)

…………..

…………..

i-1

i

(i,j)

slide8

Fundamental Pairwise Recursion.

P(s1i->s2j) = p’0P(s1i-1->s2j) +

Initial Condition P(s10 ->s2j) = pj’’ps2[1:j]

Simplification: Ri,j=(p1f(s1[i],s2[j])+p’1ps2j[j])P(s1i-1->s2j-1)

+ lb ps2[j]Ri,j-1

P(s1i->s2j) = Ri,j + p’0 P(s1i->s2j-1)

P(s1i->s2j) =

p’0P(s1i-1->s2j)+

 lbP(s1i->s2j-1) +

(p1f(s1[i],s2[j]+p’1ps2j[j]- lb ps2j[j] ))P(s1i-1->s2j-1)

Probability of observationP(s1,s2) = P(s1) P(s1 ->s2)

slide9

Ancestral Sequence Generator

# E

* l/m 1- l/m

#l/m 1- l/m

* # # # #

p’’ function generator

-

#

E

* - - - -

* # # # #

lb 1- lb

lb 1- lb

*

*

-

#

p’/p function generator

# - - - -

# # # # #

-

#

E

lb 1- lb

1-mb mb

#

#

#

-

# - - - -

- # # # #

lb 1- lb

-

#

Markov Chains Generating the p-functions

slide10

Statistical Alignment

Steel and Hein,2001 + Holmes and Bruno,2001

T

An HMM Generating Alignments

- # # E

# # - E

*

* lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb)

-

# lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb)

_

#lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb)

#

- lb

C

C

A

C

Emit functions:

e(##)= p(N1)f(N1,N2)

e(#-)= p(N1),e(-#)= p(N2)

p(N1) - equilibrium prob. of N

f(N1,N2) - prob. that N1 evolves into N2

slide11

Corner Cutting ~100-1000

Better Numerical Search ~10-100

Ex.: good start guess, 28 evaluations, 3 iterations

Accelleration of Pairwise Algorithm

(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)

Simpler Recursion ~3-10

Faster Computers ~250

1991-->2000 ~106

slide12

a-globin (141) and b-globin (146)

(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)

430.108 : -log(a-globin)

327.320 : -log(a-globin -->b-globin)

747.428 : -log(a-globin, b-globin) = -log(l(sumalign))

l*t: 0.0371805 +/- 0.0135899

m*t: 0.0374396 +/- 0.0136846

s*t: 0.91701 +/- 0.119556

E(Length) E(Insertions,Deletions) E(Substitutions)

143.499 5.37255 131.59

Maximum contributing alignment:

V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT

VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS

NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Ratio l(maxalign)/l(sumalign) = 0.00565064

slide13

VLSPADNAL.....DLHAHKR 141 AA long

*########### …. ### 141 AA long

2 108 years

2 107 years

2 109 years

*########### …. ###

*########### …. ###

???????????????????? k AA long

109 years

The invasion of the immortal link

slide14

Long Insertion-Deletions

can model overlapping indels

more involved dynamic programming:

slide15

Homology test.

(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)

D(s1,s2) is evaluated in D(s1,s2*)

a-, myoglobin homology tests

Random s1 = ATWYFC-AKAC

s2* = LTAYKADCWLE

*

Real s1 = ATWYFCAK-AC

s2 = ETWYKCALLAD

*** ** *

Wi,j= -ln(pi*P2.5i,j/(pi*pj))

1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart.

2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary.

slide16

Sample random alignments from real sequences

Sample random alignments from random sequences

cgtgttacatatatatagccgatagccg

cgtgttacatatatatagccgatagccg

cgtgttacatatatatagccgatagccg

cgtgttacatatatatagccgatagccg

Compare real and random distribution using

Chi-square statistic.

Goodness-of-fit of TKF91

slide17

TKF92 - Unbreakable fragments

  • Fragments evolve into fragments.
  • All possible tilings of the sequences with geometric length fragments are considered.
slide18

Algorithm for alignment on star tree (O(length6))(Steel & Hein, 2001)

*ACGC

*TT GT

s2

s1

a

s3

*ACG GT

*######

* (l/m)

slide19

Binary Tree Problem

a1a2

* *

# #

# -

- #

# #

- #

TGA

ACCT

s1

s3

a1

a2

s2

s4

GTT

ACG

  • The ancestral sequences & their alignment was known.

ii. The alignment of ancestral alignment columns to leaf sequences was known

The problem would be simpler if:

How to sum over all possible ancestral sequences and their alignments?:

A Markov chain generating ancestral alignments can solve the problem!!

slide20

- # # E

# # - E

*

* lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb)

#

# lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb)

_

#lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb)

#

- lb

Generating Ancestral Alignments

a1 *

a2 *

#

#

l/m (1- lb)e-m

E

E

(1- l/m) (1- lb)

-

#

lb

slide21

The Basic Recursion

”Remove 1st step” - recursion:

S

E

”Remove last step” - recursion:

Last/First step removal are inequivalent, but have the same complexities. First step algorithm is the simplest.

slide22

#

#

#

#

-

#

=

Where P’(kS i,H) =

F(kSi,H)

Sequence Recursion: First Step Removal

Pa(Sk): Epifixes (S[k+1:l]) starting in given MC starts in a.

slide23

Contrasting Probability versus Distance Recursions

Probability:

Distance (Sankoff, 1973):

A

#

#

#

#

-

#

C

=

=

+

-

A

15 cases

slide24

Maximum likelihood phylogeny and alignment

Gerton Lunter

Istvan Miklos

Alexei Drummond

Yun Song

Human alpha hemoglobin;Human beta hemoglobin;

Human myoglobin

Bean leghemoglobin

Probability of data e-1560.138

Probability of data and alignment e-1593.223

Probability of alignment given data 4.279 * 10-15 = e-33.085

Ratio of insertion-deletions to substitutions: 0.0334

slide25

Gibbs Samplers for Statistical Alignment

Holmes & Bruno (2001):

Sampling Ancestors to pairs.

Jensen & Hein (in press):

Sampling nodes adjacent to triples

Slower basic operation, faster mixing

slide26

The phylogeny moves:

As in Drummond et al. 2002

Metropolis-Hastings Statistical Alignment.

Lunter, Drummond, Miklos, Jensen & Hein, 2005

The alignment moves:

QST--QCC-S

S------CCS

---QST--QC

---QST--QC

TNQHVSCTGN

GN-HVSCTGK

TNQH-SCTLN

TNQHVSCTLN

ALITL---GG

ALLTLTTLGG

---TLTSLGA

ALLGLTSLGA

We choose a random window in the current alignment

Then delete all gaps so we get back subsequences

QSTQCCS

SCCS

QSTQC

QSTQC

TNQHVSCTGN

GN-HVSCTGK

TNQH-SCTLN

TNQHVSCTLN

ALITL---GG

ALLTLTTLGG

---TLTSLGA

ALLGLTSLGA

QSTQCCS

-S--CCS

QSTQC--

QSTQC--

TNQHVSCTGN

GN-HVSCTGK

TNQH-SCTLN

TNQHVSCTLN

ALITL---GG

ALLTLTTLGG

---TLTSLGA

ALLGLTSLGA

Stochastically realign this part

slide27

Metropolis-Hastings Statistical Alignment

Lunter, Drummond, Miklos, Jensen & Hein, 2005

slide28

The Basics of Footprinting I

  • Many aligned sequences related by a known phylogeny:

positions

HMM:

1

n

1

sequences

k

slow - rs

HMM:

fast - rf

  • Two un-aligned sequences:

A

C

G

T

ATG

A-C

A

slide29

Many un-aligned sequences related by a known phylogeny:

  • Conceptually simple, computationally hard
  • Dependent on a single alignment/no measure of uncertainty
  • Statistical Alignment
  • Explicit stochastic model of substitution and indel evolution

A C

sometimes

-

#

#

#

#

-

HMM:

A T G

  • Advantages: Summing over uncertainty + confidence on inference

The Basics of Footprinting II

slide30

Statistical Alignment andFootprinting.

acgtttgaaccgag----

1

acgtttgaaccgag----

sequences

sequences

1

k

k

Comment:The A-HMM * S-HMM is an approximate approach as S-HMM does not include an evolutionary model

acgtttgaaccgag----

1

sequences

Alignment HMM

k

Ex.:

nnnnnnnnnnn

Alignment HMM

Signal HMM

nnnnnnnnnnn

slide31

Structure HMM

S

F

F

F

S

S

0.1

0.1

0.1

0.1

0.9

0.9

F

S

SF

FS

SS

FF

(A,S)

F F S S F

Alignment HMM

?

Structure HMM

“Structure” does not stem from an evolutionary model

  • The equilibrium annotation
  • does not follow a Markov Chain:
  • Each alignment in from theAlignment HMM
  • is annotated by the Structure HMM:
  • No ideal way of simulating:

using the HMM at the alignment will give other distributions on the leaves

using the HMM at the root will give other distributions on the leaves

slide32

Structure Description

  • Simple Promotor/Enchancer Structure: only Fast/Slow

Start

M2

=

M2

M3

M1

Stop

Alignment HMM

Structure HMM

  • Advanced Promotor/Enchancer Structure: General HMM (J. Liu)
  • De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84
  • For instance, different nature of indel process
  • The substitution process
  • Other possibilities:
  • Gene Structure/RNA Structure
slide33

An example

Previously-identified binding sites indicated by colored boxes

  • Predicted functional elements in RED BOLD TEXT
  • In overall region, program correctly identified 8 out of 11 binding sites with 4 false positives
    • Overlapping binding sites may indicate repressor relationships
    • False positives show lesser degree of conservation, could be undetected binding sites
slide34

An example further

  • 8 out of 11 binding sites correctly identified, total of 4 false identifications, one of which lay adjacent to the true binding site.
  • Issues with the highly analyzed regions as gold standards - probably only find very strong regulatory regions.
slide35

References Statistical Alignment

  • Fleissner R, Metzler D, von Haeseler A.Simultaneous statistical multiple alignment and phylogeny reconstruction.Syst Biol. 2005 Aug;54(4):548-61.
  • Hein,J., C.Wiuf, B.Knudsen, Møller, M., and G.Wibling (2000): Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit. (J. Molecular Biology 302.265-279)
  • Hein,J.J. (2001): A generalisation of the Thorne-Kishino-Felsenstein model of Statistical Alignment to k sequences related by a binary tree. (Pac.Symp.Biocompu. 2001 p179-190 (eds RB Altman et al.)
  • Steel, M. & J.J.Hein (2001): A generalisation of the Thorne-Kishino-Felsenstein model of Statistical Alignment to k sequences related by a star tree. ( Letters in Applied Mathematics)
  • Hein JJ, J.L.Jensen, C.Pedersen (2002) Algorithms for Multiple Statistical Alignment. (PNAS) 2003 Dec 9;100(25):14960-5.
  • • Holmes, I. (2003) Using Guide Trees to Construct Multiple-Sequence Evolutionary HMMs.Bioinformatics, special issue for ISMB2003, 19:147i–157i.
  • • Jensen, J.L. & Hein, J. (2004) A Gibbs sampler for statistical multiple alignment. Statistica Sinica, in press.
  • • Miklós, I., Lunter, G.A. & Holmes, I. (2004) A 'long indel' model for evolutionary sequence alignment. Mol. Biol. Evol. 21(3):529–540.
  • • Lunter, G.A., Miklós, I., Drummond, A.J., Jensen, J.L. & Hein, J. (2005) Bayesian Coestimation of Phylogeny and Sequence Alignment. BMC Bioinformatics, 6:83
  • • Lunter, G.A., Miklós, I., Drummond, A., Jensen, J.L. & Hein, J. (2003) Bayesian phylogenetic inference under a statistical indel model. pspdfLecture Notes in Bioinformatics, Proceedings of WABI'03, 2812:228–244.
  • • Lunter, G.A., Miklós, I., Song, Y.S. & Hein, J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees.J. Comp. Biol., 10(6):869–88Miklos, Lunter & Holmes (2002) (submitted ISMB)
  • Miklos, I & Toroczkai Z. (2001) An improved model for statistical alignment, in WABI2001, Lecture Notes in Computer Science, (O. Gascuel & BME Moret, eds) 2149:1-10. Springer, Berlin
  • Metzler D. “Statistical alignment based on fragment insertion and deletion models.” Bioinformatics. 2003 Mar 1;19(4):490-9.
  • Miklos, I (2002) An improved algorithm for statistical alignment of sequences related by a star tree. Bul. Math. Biol. 64:771-779.
  • Miklos, I: Algorithm for statistical alignment of sequences derived from a Poisson sequence length distribution Disc. Appl. Math. accepted.
  • Thorne JL, Kishino H, Felsenstein J.Inching toward reality: an improved likelihood model of sequence evolution.J Mol Evol. 1992 Jan;34(1):3-16.
  • Thorne JL, Kishino H, Felsenstein J.An evolutionary model for maximum likelihood alignment of DNA sequences.J Mol Evol. 1991 Aug;33(2):114-24. Erratum in: J Mol Evol 1992 Jan;34(1):91.
  • Thorne JL, Churchill GA.Estimation and reliability of molecular sequence alignments.Biometrics. 1995 Mar;51(1):100-13.

TKF92, Long Indel, Explain HMM, Multiple Recursion, Hidden State Space, 1-state recursion and other reductions, competing algorithms,