Hidden markov models
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

Hidden Markov Models PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

1. 2. 2. 1. 1. 1. 1. …. 2. 2. 2. 2. …. K. …. …. …. …. x 1. K. K. K. K. x 2. x 3. x K. …. Hidden Markov Models. Variants of HMMs. Higher-order HMMs. How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl

Download Presentation

Hidden Markov Models

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hidden markov models

1

2

2

1

1

1

1

2

2

2

2

K

x1

K

K

K

K

x2

x3

xK

Hidden Markov Models


Variants of hmms

Variants of HMMs


Higher order hmms

Higher-order HMMs

  • How do we model “memory” larger than one time point?

  • P(i+1 = l | i = k)akl

  • P(i+1 = l | i = k, i -1 = j)ajkl

  • A second order HMM with K states is equivalent to a first order HMM with K2 states

aHHT

state HH

state HT

aHT(prev = H)

aHT(prev = T)

aHTH

state H

state T

aHTT

aTHH

aTHT

state TH

state TT

aTH(prev = H)

aTH(prev = T)

aTTH


Similar algorithms to 1 st order

Similar Algorithms to 1st Order

  • P(i+1 = l | i = k, i -1 = j)

    • Vlk(i) = maxj{ Vkj(i – 1) + … }

    • Time? Space?


Modeling the duration of states

Modeling the Duration of States

1-p

Length distribution of region X:

E[lX] = 1/(1-p)

  • Geometric distribution, with mean 1/(1-p)

    This is a significant disadvantage of HMMs

    Several solutions exist for modeling different length distributions

X

Y

p

q

1-q


Hidden markov models

Example: exon lengths in genes


Solution 1 chain several states

Solution 1: Chain several states

p

1-p

X

Y

X

X

q

1-q

Disadvantage: Still very inflexible

lX = C + geometric with mean 1/(1-p)


Solution 2 negative binomial distribution

Solution 2: Negative binomial distribution

Duration in X: m turns, where

  • During first m – 1 turns, exactly n – 1 arrows to next state are followed

  • During mth turn, an arrow to next state is followed

    m – 1 m – 1

    P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n

p

p

p

1 – p

1 – p

1 – p

Y

X(n)

X(1)

X(2)

……


Example genes in prokaryotes

Example: genes in prokaryotes

  • EasyGene:

    Prokaryotic

    gene-finder

    Larsen TS, Krogh A

  • Negative binomial with n = 3


Solution 3 duration modeling

Solution 3:Duration modeling

Upon entering a state:

  • Choose duration d, according to probability distribution

  • Generate d letters according to emission probs

  • Take a transition to next state according to transition probs

    Disadvantage: Increase in complexity of Viterbi:

    Time: O(D)

    Space: O(1)

    where D = maximum duration of state

F

d<Df

xi…xi+d-1

Pf

Warning, Rabiner’s tutorial claims O(D2) & O(D) increases


Viterbi with duration modeling

Viterbi with duration modeling

emissions

emissions

Recall original iteration:

Vl(i) = maxk Vk(i – 1) akl el(xi)

New iteration:

Vl(i) = maxk maxd=1…DlVk(i – d) Pl(d) akl j=i-d+1…iel(xj)

F

L

d<Df

d<Dl

Pl

Pf

transitions

xi…xi + d – 1

xj…xj + d – 1

Precompute cumulative values


Proteins pair hmms and alignment

Proteins, Pair HMMs, and Alignment


A state model for alignment

A state model for alignment

M

(+1,+1)

Alignments correspond 1-to-1 with sequences of states M, I, J

I

(+1, 0)

J

(0, +1)

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII


Let s score the transitions

Let’s score the transitions

s(xi, yj)

M

(+1,+1)

Alignments correspond 1-to-1 with sequences of states M, I, J

s(xi, yj)

s(xi, yj)

-d

-d

I

(+1, 0)

J

(0, +1)

-e

-e

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC

IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII


Alignment with affine gaps state version

Alignment with affine gaps – state version

Dynamic Programming:

M(i, j):Optimal alignment of x1…xi to y1…yjending in M

I(i, j): Optimal alignment of x1…xi to y1…yj ending in I

J(i, j): Optimal alignment of x1…xi to y1…yjending in J

The score is additive, therefore we can apply DP recurrence formulas


Alignment with affine gaps state version1

Alignment with affine gaps – state version

Initialization:

M(0,0) = 0;

M(i, 0) = M(0, j) = -, for i, j > 0

I(i,0) = d + ie;J(0, j) = d + je

Iteration:

M(i – 1, j – 1)

M(i, j) = s(xi, yj) + max I(i – 1, j – 1)

J(i – 1, j – 1)

e + I(i – 1, j)

I(i, j) = max

d + M(i – 1, j)

e + J(i, j – 1)

J(i, j) = max

d + M(i, j – 1)

Termination:

Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }


Brief introduction to the evolution of proteins

Brief introduction to the evolution of proteins

Protein sequence and structure

Protein classification

Phylogeny trees

Substitution matrices


Muscle cells and contraction

Muscle cells and contraction


Actin and myosin during muscle movement

Actin and myosin during muscle movement


Actin structure

Actin structure


Actin sequence

Actin sequence

  • Actin is ancient and abundant

    • Most abundant protein in cells

    • 1-2 actin genes in bacteria, yeasts, amoebas

    • Humans: 6 actin genes

      • -actin in muscles; -actin, -actin in non-muscle cells

      • ~4 amino acids different between each version

        MUSCLE ACTIN Amino Acid Sequence

        1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR

        61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI

        121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG

        181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL

        241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG

        301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY

        361 DESGPSIVHR KCF


A related protein in bacteria

A related protein in bacteria


Relation between sequence and structure

Relation between sequence and structure


Protein phylogenies

Protein Phylogenies

  • Proteins evolve by both duplication and species divergence


Protein phylogenies example

Protein Phylogenies – Example


Structure determines function

Structure Determines Function

The Protein Folding Problem

  • What determines structure?

  • Energy

  • Kinematics

  • How can we determine structure?

  • Experimental methods

  • Computational predictions


Primary structure sequence

Primary Structure: Sequence

  • The primary structure of a protein is the amino acid sequence


Primary structure sequence1

Primary Structure: Sequence

  • Twenty different amino acids have distinct shapes and properties


Primary structure sequence2

Primary Structure: Sequence

A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"


Secondary structure loops

Secondary Structure: , , & loops

  •  helices and  sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms


Tertiary structure a protein fold

Tertiary Structure: A Protein Fold


Pdb growth

PDB Growth

New PDB structures


Only a few folds are found in nature

Only a few folds are found in nature


Protein classification

Protein classification

  • Number of protein sequences grows exponentially

  • Number of solved structures grows exponentially

  • Number of new folds identified very small (and close to constant)

  • Protein classification can

    • Generate overview of structure types

    • Detect similarities (evolutionary relationships) between protein sequences

    • Help predict 3D structure of new protein sequences

Classification of 27,599 protein structures in PDB


Protein structure classification

Protein world

Protein structure classification

Protein fold

Protein superfamily

Protein family

Morten Nielsen,CBS, BioCentrum, DTU


Structure classification databases

Structure Classification Databases

  • SCOP

    • Manual classification (A. Murzin)

    • scop.berkeley.edu

  • CATH

    • Semi manual classification (C. Orengo)

    • www.biochem.ucl.ac.uk/bsm/cath

  • FSSP

    • Automatic classification (L. Holm)

    • www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU


Major classes in scop

Major classes in SCOP

  • Classes

    • All a proteins

    • All b proteins

    • a and b proteins (a/b)

    • a and b proteins (a+b)

    • Multi-domain proteins

    • Membrane and cell surface proteins

    • Small proteins

    • Coiled coil proteins

Morten Nielsen,CBS, BioCentrum, DTU


All a hemoglobin 1bab

All a: Hemoglobin (1bab)

Morten Nielsen,CBS, BioCentrum, DTU


All b immunoglobulin 8fab

All b: Immunoglobulin (8fab)

Morten Nielsen,CBS, BioCentrum, DTU


A b triosephosphate isomerase 1hti

a/b:Triosephosphate isomerase (1hti)

Morten Nielsen,CBS, BioCentrum, DTU


A b lysozyme 1jsf

a+b: Lysozyme (1jsf)

Morten Nielsen,CBS, BioCentrum, DTU


Families

Families

  • Proteins whose evolutionarily relationship is readily recognizable from the sequence

    (>~25% sequence identity)

  • Families are further subdivided into Proteins

  • Proteins are divided into Species

    • The same protein may be found in several species

Fold

Superfamily

Family

Proteins

Morten Nielsen,CBS, BioCentrum, DTU


Superfamilies

Superfamilies

  • Proteins which are (remotely) evolutionarily related

    • Sequence similarity low

    • Share function

    • Share special structural features

  • Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Fold

Superfamily

Family

Proteins

Morten Nielsen,CBS, BioCentrum, DTU


Folds

Folds

  • >~50% secondary structure elements arranged in the same order in sequence and in 3D

  • No evolutionary relation

Fold

Superfamily

Family

Proteins

Morten Nielsen,CBS, BioCentrum, DTU


Substitutions of amino acids

Substitutions of Amino Acids

Mutation rates between amino acids have dramatic differences!


Substitution matrices

Substitution Matrices

BLOSUM matrices:

  • Start from BLOCKS database (curated, gap-free alignments)

  • Cluster sequences according to > X% identity

  • Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes

  • Estimate

    P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)


Probabilistic interpretation of an alignment

Probabilistic interpretation of an alignment

An alignment is a hypothesis that the two sequences are related by evolution

Goal:

Produce the most likely alignment

Assert the likelihood that the sequences are indeed related


A pair hmm for alignments

A Pair HMM for alignments

Model M

1 – 2

This model generates two sequences simultaneously

Match/Mismatch state M:

P(x, y) reflects substitution frequencies between pairs of amino acids

Insertion states I, J:

P(x), P(y) reflect frequencies of each amino acid

: set so that 1/2 is avg. length before next gap

:set so that 1/(1 – ) is avg. length of a gap

M

P(xi, yj)

1 – 

1 – 

I

P(xi)

J

P(yj)

optional


A pair hmm for unaligned sequences

A Pair HMM for unaligned sequences

Model R

Two sequences are independently generated from one another

P(x, y | R) = P(x1)…P(xm) P(y1)…P(yn) = i P(xi) j P(yj)

1

1

J

P(yj)

I

P(xi)


To compare alignment vs random hypothesis

To compare ALIGNMENT vs. RANDOM hypothesis

1 – 2

Every pair of letters contributes:

M

  • (1 – 2) P(xi, yj) when matched

  •  P(xi) P(yj) when gapped

    R

  • P(xi) P(yj) in random model

    Focus on comparison of

    P(xi, yj) vs. P(xi) P(yj)

M

P(xi, yj)

1 – 

1 – 

I

P(xi)

J

P(yj)

1

1

J

P(yj)

I

P(xi)


To compare alignment vs random hypothesis1

To compare ALIGNMENT vs. RANDOM hypothesis

1 – 2

Every pair of letters contributes:

M

  • (1 – 2) P(xi, yj) when matched

  •  P(xi) P(yj) when gapped

    R

  • P(xi) P(yj) in random model

    Focus on comparison of

    P(xi, yj) vs. P(xi) P(yj)

M

P(xi, yj)

1 – 2

1 – 2

(1 – )

-----------

(1 – 2)

I

P(xi)

J

P(yj)

Equivalent!

1

1

J

P(yj)

I

P(xi)


To compare alignment vs random hypothesis2

To compare ALIGNMENT vs. RANDOM hypothesis

Idea:

We will divide alignment score by the random score, and take logarithms

Let

P(xi, yj)

s(xi, yj) = log ––––––––– + log (1 – 2)

P(xi) P(yj)

 (1 – ) P(xi)

d = – log –––––––––––––

(1 – 2) P(xi)

 P(xi)

e = – log ––––––

P(xi)

=Defn substitution score

=Defn gap initiation penalty

=Defn gap extension penalty


The meaning of alignment scores

The meaning of alignment scores

  • The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps

    VM(i, j) = max { VM(i – 1, j – 1), VI( i – 1, j – 1) – d, Vj( i – 1, j – 1) } + s(xi, yj)

    VI(i, j) = max { VM(i – 1, j) – d, VI( i – 1, j) – e }

    VJ(i, j) = max { VM(i – 1, j) – d, VI( i – 1, j) – e }

    • s(.,.) (1 – 2) ~how often a pair of letters substitute one another

    • 1/mean length of next gap

    • (1 – ) / (1 – 2) 1/mean arrival time of next gap


The meaning of alignment scores1

The meaning of alignment scores

Match/mismatch scores:

P(xi, yj)

s(a, b)  log –––––––––– (ignore log(1 – 2) for the moment)

P(xi) P(yj)

Example:

DNA regions between human and mouse genes have average conservation of 80%

  • What is the substitution score for a match?

    P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8  P(x, x) = 0.2

    P(a) = P(c) = P(g) = P(t) = 0.25

    s(x, x) = log [ 0.2 / 0.252 ] = 1.163

  • What is the substitution score for a mismatch?

    P(a, c) +…+P(t, g) = 0.2  P(x, yx) = 0.2/12 = 0.0167

    s(x, y  x) = log[ 0.0167 / 0.252 ] = -1.322

  • What ratio matches/(matches + mism.) gives score 0?

    x(#match) – y(#mism) = 0

    1.163 (#match) – 1.322 (#mism) = 0

    #match = 1.137(#mism)

    matches = 53.2%


Substitution matrices1

Substitution Matrices

BLOSUM matrices:

  • Start from BLOCKS database (curated, gap-free alignments)

  • Cluster sequences according to > X% identity

  • Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes

  • Estimate

    P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)


Blosum matrices

BLOSUM matrices

BLOSUM 50

BLOSUM 62

(The two are scaled differently)


  • Login