Download Presentation
Markov Models and HMM in Genome Analysis

Loading in 2 Seconds...

1 / 23

# Markov Models and HMM in Genome Analysis - PowerPoint PPT Presentation

Journée Statistique et Santé Paris 5 – 5 mai 2006. Markov Models and HMM in Genome Analysis. Bernard PRUM. La genopole – Evry – France prum@genopole.cnrs.fr. Why Markov Models ?. A biological sequence : X = (X 1 , X 2 , … , X n ) where X k   A = { t , c , a , g }

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

## PowerPoint Slideshow about 'Markov Models and HMM in Genome Analysis' - thadeus

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Journée Statistique et Santé

Paris 5 – 5 mai 2006

Markov Models and HMMin Genome Analysis

Bernard PRUM

La genopole – Evry – France

prum@genopole.cnrs.fr

Why Markov Models ?

A biological sequence :

X = (X1, X2, … , Xn)

where XkA = { t , c , a , g }

or {A C D E F G H I K L M N P Q R S T V W Y}

A very common tool for analyzing these sequences is the

Markov Model (MM)

P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A

denoted by π(u , v) if Xk – 1 = u

A complex, called Rec BCD,

protects the cell against viruses

Why MM ? – 2

Exemple :

Rec BCD

E. coli

viruses

chi

own bacteria genome

To avoid the destruction of the genome of the cell, along the genome exists a passwordgctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.

Why MM ? – 3

The « exceptional character» of a word depends on

• the frequency of each letter

 model M0 (Bernoulli)

Exemple HIV1

t 2164

c 1773

a 3410

g 2370

Why MM ? – 3

The « exceptional character» of a word depends on

• the frequency of each letter

 model M0 (Bernoulli),

• the frequency of 2-words tt , tc ta, tg, … gg

 model M1 (Markov)

Exemple HIV1

t c a g

t 548 342 684 590 2164

c 470 413 795 95 1773

a 713 561 1112 1024 3410

g 432 457 820 661 2370

Hidden Markov Models

An important criticism against Markov modelization is its stationarity:

a well known theorem says that, under weak conditions,

P(Xk = u)  µ(u) (when k  ∞)

(and the rate of convergence is exponential.)

But biological sequences are not homogeneous.

There are g+c rich segments / g+c poor segments (isochores).

One may presume (and verify) that the rules of succession of letters

differ in coding parts / non-coding parts.

Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation

HMM – 2

Suppose that d states alternate along the sequence

Sk = 1 Sk = 2Sk = 1Sk = 2 Sk = 1

And in each state we have a MC :

if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v)

if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)

and (more technical than biological - see HSMM)

P(Sk = y | Sk–1 = x) = π0(u ; v)

• Estimate the parameters π1, π2, π0

• Allocate a state {1, 2} to each position

Our objectives

HMM – 3

¡ Use the likelihood !!

L(q) = ∑ µ0(S1) µS (X1) ....

1

...∏ π0(Sk-1,Sk) πk (Xk-1,Xk)

n terms (length of the sequence)

over all possibilities S1S2...Sn ; there are sn terms

Désespoir !!!

210 000 = 103 000

back to HMM

Step n° 1

We do not know what is S= 1 and what is Sk = 2

We make an arbitrary allocation of these states :

Sk = 1 Sk = 2Sk = 1Sk = 2 Sk = 1

“Knowing” the states, it is obvious to compute the parameters

exple : π1(c,g) = % of the c which are followed by a g in green parts.

π0(•,•) = % of • which are followed by a •

Step n° 2

Baum & Churchill formula

Define

the predictive probability ak(v) = P(Sk = v | X1k-1)

the filtragee probability bk(v) = P(Sk = v | X1k)

the estimated probability jk(v) = P(Sk = v | X1n)

Bayes formula

Forward (k = 2 to n)

ak(v) = ∑ubk-1(u) π0 (u, v)

ak-1(u) πu (Xk-2, Xk-1)

bk-1(u) =

∑w ak-1(w) πw (Xk-2, Xk-1)

Backward (k = n to 2)

jk(v)

jk-1(v) = bk-1(u) ∑v π0 (u, v)

bk(v)

and Sk ?

Bad idea : Sk = arg maxjk(v) (*)

First good idea : keep the distribution jk(v)

[EM]

Second good idea : draw Sk according to jk(v)

[SEM]

(*) except, may be at the end

of the algorithm (freezing)

SHMM

••

(

)

New defect : transition between states is described using a Markov model

1 - p p

q 1 - q

π0 =

Consequence : the length of segments of ‘a given colour’ are r.v. ~ Expo(–)

SHMM

Does not correspond to the reality ! !

Histograms of ‘biological segments‘ (after smoothing) look more like

density g(h)

h

It is easy to make the probability of leaving the state depend on h

to get the suitable law :

g(h)

p(h) =

1 - G(h)

Searching nucleosomes

What are nucleosomes ?

In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar :

|

|

|

10 nm

Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone(H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.

Nucleosomes

Curvature within curvature

Back to one nucleosome :

the DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides.

Total = 146 nt within each nucleosome.

Depending on the position (“in”vs “out”) the curvature satisfies different constraints

Bendability

Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ;

it depends on succession of 2, 3, 4, ...di-nucleotides.

g

a

t

t

g

a

c

t

a

c

t

a

q

PNUC table

2nd letter

a t g c

a 0.0 2.8 3,3 5.2 a

t 7,3 7,3 10,0 10,0 a

g 3,0 6,4 6,2 7,5 a

c 3,3 2,2 8,3 5,4 a

a 0.7 0.7 5,8 5,8 t

t 2.8 0.0 5.2 3,3 t

g 5.3 3.7 5.4 7,5 t

c 6,7 5.2 5.4 5.4 t

1rst letter 3rd letter

a 5.2 6,7 5.4 5.4 g

t 2,2 3,3 5.4 8,3 g

g 5.4 6,5 6,0 7,5 g

c 4,2 4,2 4,7 4,7 g

a 3.7 5.3 7,5 5.4 c

t 6,4 3,0 7,5 6,2 c

g 5.6 5.6 8.2 8.2 c

c 6,5 5.4 7,5 6,0 c

There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...)

We used PNUC-3 :

PNUC(cga) = 8,3

PNUC(tcg) = 8,3

(*) Goodsell, Dickerson, NAR 22 (1994)

HMM for nucleosomes

Better : consider a different state for ”each” position in the nucleosome (say 146 states)

1 2 3

. . .

145 146

and the repetition of r (say 4) identical states for the between-nucleosome regions (= spacer).

These brother states give to the law of the length of the b.n. regions a Gamma form which is not geometrical ! !

A no-nuc state

Trifonov (99) as well as Rando (05) underline that

there are ‘no‘ nucleosome in the gene promotors (accessibility)

The introduce “before“ nucleosome a “no-nucleosme” state.

1 2

. . .

70

nucleosome core

no-nuc

spacer

Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)

Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)

Acknowledgements

Labo «Statistique et Génome» Labo MIG – INRA

Maurice BAUDRY Philippe BESSIÈRE

Etienne BIRMELE François RODOLPHE

Cécile COT Sophie SCHBATH

Mark HOEBEKE Élisabeth de TURCKHEIM

Mickael GUEDJ

François KÉPÈS

Sophie LEBRE Labo AGRO

Catherine MATIAS Jean-Noël BACRO

Vincent MIELE Jean-Jacques DAUDIN

Florence MURI-MAJOUBE Stéphane ROBIN

Grégory NUEL

Hugues RICHARD

Anne-Sophie TOCQUET Lab’ RouenDominique CELLIER

Nicolas VERGNE Dominique CELLIER

Sec : Michèle ILBERT Sabine MERCIER