Hidden markov models what are the good for
Download
1 / 40

Hidden Markov Models What are the good for? - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Hidden Markov Models What are the good for?. Morten Nielsen CBS. Absolutely nothing!. Objectives. Introduce Hidden Markov models and understand that they are just weight matrices with gaps See the beauty of sequence profiles Position specific scoring matrices (PSSMs)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Hidden Markov Models What are the good for?' - troy-contreras


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hidden markov models what are the good for

Hidden Markov ModelsWhat are the good for?

Morten Nielsen

CBS



Objectives
Objectives

  • Introduce Hidden Markov models and understand that they are just weight matrices with gaps

  • See the beauty of sequence profiles

    • Position specific scoring matrices (PSSMs)

  • Understand what biological problems are best described using HMM’s

    • And which are not!


Outline

What is an HMM

What are they good for?

How to construct an HMM

How to “score” a sequence to an HMM

Viterbi decoding

HMM’s that made a difference

Profile HMMs

TMHMM

Links to HMM packages

Outline


Markov models
Markov Models

  • A model with no memory

    • What I decide depends only on “state” now, not on what I have learned in the past

    • No dependence on i-1, i-2 …


A markov model
A Markov model?

  • No memory

  • Model generates numbers

    • 312453666641

The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

0.9

0.95

1:1/6

2:1/6

3:1/6

4:1/6

5:1/6

6:1/6

1:1/10

2:1/10

3:1/10

4:1/10

5:1/10

6:1/2

0.05

0.10

Loaded

Fair


Why hidden
Why hidden?

  • Model generates numbers

    • 312453666641

  • Does not tell which dice was used

  • Alignment (decoding) can give the most probable solution/path (Viterby)

    • FFFFFFLLLLLL

  • Or most probable set of states

    • FFFFFFLLLLLL

The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

0.9

0.95

1:1/6

2:1/6

3:1/6

4:1/6

5:1/6

6:1/6

1:1/10

2:1/10

3:1/10

4:1/10

5:1/10

6:1/2

0.05

0.10

Loaded

Fair


Hmm a simple example

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

Example from A. Krogh

Core region defines the number of states in the HMM (red)

Insertion and deletion statistics are derived from the non-core part of the alignment (black)

HMM (a simple example)

Core of alignment


Hmm construction
HMM construction

  • 5 matches. A, 2xC, T, G

  • 5 transitions in gap region

    • C out, G out

    • A-C, C-T, T out

    • Out transition 3/5

    • Stay transition 2/5

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

.4

.2

A

C

G

T

.4

.2

.2

.6

.6

.8

A

C

G

T

A

C

G

T

A

C

G

T

.8

A

C

G

T

1

A

C

G

T

A

C

G

T

1.

1.

1.

1.

.4

.8

.2

.8

.2

.2

.2

.8

.2

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2


Align sequence to hmm
Align sequence to HMM

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2

ACAC--AGC =1.2x10-2

Consensus:

ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2

Exceptional:

TGCT--AGG =0.0023x10-2


Align sequence to hmm null model

Score depends strongly on length

Null model is a random model. For length L the score is 0.25L

Log-odds score for sequence S

Log( P(S)/0.25L)

Positive score means more likely than Null model

ACA---ATG = 4.9

TCAACTATC = 3.0

ACAC--AGC = 5.3

AGA---ATC = 4.9

ACCG--ATC = 4.6

Consensus:

ACAC--ATC = 6.7

ACA---ATC = 6.3

Exceptional:

TGCT--AGG = -0.97

Align sequence to HMM - Null model

Note!


Model decoding viterby

Example: 1245666. What was the series of dice used to generate this output?

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Model decoding (Viterby)


Dynamic programming computation of scores
Dynamic programming: computation of scores generate this output?

T C G C A

T

C

C

A

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

x

Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.

Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

score(x,y) = max


Model decoding viterby1

Example: 1245666. What was the series of dice used to generate this output?

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Model decoding (Viterby)


Model decoding viterby2

Log model generate this output?

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Model decoding (Viterby)


Model decoding viterby3

Log model generate this output?

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Model decoding (Viterby)

Identify what series of dice was used to generate this output?


Model decoding viterby4

Log model generate this output?

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Model decoding (Viterby)

Series of dice is FFFFLLL


Hmm s and weight matrices
HMM’s and weight matrices generate this output?

  • In the case of un-gapped alignments HMM’s become simple weight matrices


Hmm construction1
HMM construction generate this output?

.4

X

.2

A

C

G

T

.4

.2

.2

.6

.6

.8

A

C

G

T

A

C

G

T

A

C

G

T

.8

A

C

G

T

1

A

C

G

T

A

C

G

T

1.

1.

1.

1.

.4

.8

.2

.8

.2

.2

.2

.8

.2


Hmm construction2
HMM construction generate this output?

.8

A

C

G

T

A

C

G

T

A

C

G

T

.8

A

C

G

T

1

A

C

G

T

A

C

G

T

1.

1.

1.

1.

1.

.8

.2

.8

.2

.2

.2

.8

.2

ACA---ATG sco = 0.8x1x0.8x1x0.8x1x1x1x0.8x1x0.2 = 3.3x10-2 or

Log-sco = log(0.8)+log(0.8)+log(0.8)+log(1)+log(0.8)+log(0.2)


Hmm s and weight matrices1
HMM’s and weight matrices generate this output?

  • In the case of un-gapped alignments HMM’s become simple weight matrices

  • To achieve high performance, the emission frequencies are estimated using the techniques of

    • Sequence weighting

    • Pseudo counts


Hmms what are they good for
HMMs. What are they good for? generate this output?

  • Weight matrices do not deal with insertions and deletions

  • In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

  • HMM is a natural frame work where insertions/deletions are dealt with explicitly


Profile hmm s
Profile HMM’s generate this output?

  • Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner

  • Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix)

  • Profile HMM’s are ideal suited to describe such position specific variations


What goes wrong when blast fails
What goes wrong when Blast fails? generate this output?

  • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences


Alignment scoring matrices
Alignment scoring matrices generate this output?

  • Blosum62 score matrix. Fg=1. Ng=0?


Alignment scoring matrices1
Alignment scoring matrices generate this output?

  • Blosum62 score matrix. Fg=1. Ng=0?

  • Score =2+6+6+4-1=17

LAGDS

I-GDS


What goes wrong when blast fails1
What goes wrong when Blast fails? generate this output?

  • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

  • This scoring matrix is identical at all positions in the protein sequence!

EVVFIGDSLVQLMHQC

X

X

X

X

X

X

AGDS.GGGDS


When blast works
When Blast works! generate this output?

1PLC._

1PLB._


When blast fails
When Blast fails! generate this output?

1PLC._

1PMY._


Sequence profiles
Sequence profiles generate this output?

  • In reality not all positions in a protein are equally likely to mutate

    • Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high

    • Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score

  • Sequence profiles can capture these differences


Profile hmm s1

Non-conserved generate this output?

Insertion

Conserved

Deletion

Must have a G

Any thing can match

Profile HMM’s

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN

TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I

-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I

IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---

-TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V

ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Core: Position with < 2 gaps


Hmm vs alignment
HMM vs. alignment generate this output?

  • Detailed description of core

    • Conserved/variable positions

  • Price for insertions/deletions varies at different locations in sequence

  • These features cannot be captured in conventional alignments


Profile profile scoring matrix
Profile-profile scoring matrix generate this output?

1K7C.A

1WAB._


Profile hmm s2
Profile HMM’s generate this output?

All M/D pairs must be visited once

L1-Y2A3V4R5-I6

P1D2P3P4I4P5D6P7


Example sequence profiles
Example. Sequence profiles generate this output?

  • Alignment of protein sequences 1PLC._ and 1GYC.A

  • E-value > 1000

  • Profile alignment

    • Align 1PLC._ against Swiss-prot

    • Make position specific weight matrix from alignment

    • Use this matrix to align 1PLC._ against 1GYC.A

  • E-value < 10-22. Rmsd=3.3


Example continued
Example continued generate this output?

Score = 97.1 bits (241), Expect = 9e-22

Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%)

Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56

F + G++ N+ + +G + +

Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79

Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98

A G +F G + ++ G+ G V

Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126

Rmsd=3.3 Å

Model red

Structure blue


Hmms what are they good for ii
HMMs. What are they good for II generate this output?

  • Trans membrane helix proteins


Hmms what are they good for ii1
HMMs. What are they good for II generate this output?

  • Transmembrane helix proteins

TMHMM. A. Krogh, 2001


Gene finding
Gene Finding generate this output?


Hmm packages
HMM packages generate this output?

  • HMMER(http://hmmer.wustl.edu/)

    • S.R. Eddy, WashU St. Louis. Freely available.

  • SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)

    • R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users.

  • META-MEME (http://metameme.sdsc.edu/)

    • William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search.

  • NET-ID, HMMpro(http://www.netid.com/html/hmmpro.html)

    • Freely available to academia, nominal license fee for commercial users.

    • Allows HMM architecture construction.

  • EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)

    • Webserver for Gibbs sampling of proteins sequences


ad