Loading in 5 sec....

Hidden Markov Models What are the good for?PowerPoint Presentation

Hidden Markov Models What are the good for?

- 90 Views
- Uploaded on
- Presentation posted in: General

Hidden Markov Models What are the good for?

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Hidden Markov ModelsWhat are the good for?

Morten Nielsen

CBS

Absolutely nothing!

- Introduce Hidden Markov models and understand that they are just weight matrices with gaps
- See the beauty of sequence profiles
- Position specific scoring matrices (PSSMs)

- Understand what biological problems are best described using HMM’s
- And which are not!

What is an HMM

What are they good for?

How to construct an HMM

How to “score” a sequence to an HMM

Viterbi decoding

HMM’s that made a difference

Profile HMMs

TMHMM

Links to HMM packages

- A model with no memory
- What I decide depends only on “state” now, not on what I have learned in the past
- No dependence on i-1, i-2 …

- No memory
- Model generates numbers
- 312453666641

The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

0.9

0.95

1:1/6

2:1/6

3:1/6

4:1/6

5:1/6

6:1/6

1:1/10

2:1/10

3:1/10

4:1/10

5:1/10

6:1/2

0.05

0.10

Loaded

Fair

- Model generates numbers
- 312453666641

- Does not tell which dice was used
- Alignment (decoding) can give the most probable solution/path (Viterby)
- FFFFFFLLLLLL

- Or most probable set of states
- FFFFFFLLLLLL

The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1

0.9

0.95

1:1/6

2:1/6

3:1/6

4:1/6

5:1/6

6:1/6

1:1/10

2:1/10

3:1/10

4:1/10

5:1/10

6:1/2

0.05

0.10

Loaded

Fair

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

Example from A. Krogh

Core region defines the number of states in the HMM (red)

Insertion and deletion statistics are derived from the non-core part of the alignment (black)

Core of alignment

- 5 matches. A, 2xC, T, G
- 5 transitions in gap region
- C out, G out
- A-C, C-T, T out
- Out transition 3/5
- Stay transition 2/5

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

.4

.2

A

C

G

T

.4

.2

.2

.6

.6

.8

A

C

G

T

A

C

G

T

A

C

G

T

.8

A

C

G

T

1

A

C

G

T

A

C

G

T

1.

1.

1.

1.

.4

.8

.2

.8

.2

.2

.2

.8

.2

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2

ACAC--AGC =1.2x10-2

Consensus:

ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2

Exceptional:

TGCT--AGG =0.0023x10-2

Score depends strongly on length

Null model is a random model. For length L the score is 0.25L

Log-odds score for sequence S

Log( P(S)/0.25L)

Positive score means more likely than Null model

ACA---ATG = 4.9

TCAACTATC = 3.0

ACAC--AGC = 5.3

AGA---ATC = 4.9

ACCG--ATC = 4.6

Consensus:

ACAC--ATC = 6.7

ACA---ATC = 6.3

Exceptional:

TGCT--AGG = -0.97

Note!

Example: 1245666. What was the series of dice used to generate this output?

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

T C G C A

T

C

C

A

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

x

Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.

Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

score(x,y) = max

Example: 1245666. What was the series of dice used to generate this output?

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Identify what series of dice was used to generate this output?

Log model

-0.05

-0.02

1:-0.78

2:-0.78

3:-0.78

4:-0.78

5:-0.78

6:-0-78

1:-1

2:-1

3:-1

4:-1

5:-1

6:-0.3

-1.3

-1

Fair

Loaded

Series of dice is FFFFLLL

- In the case of un-gapped alignments HMM’s become simple weight matrices

.4

X

.2

A

C

G

T

.4

.2

.2

.6

.6

.8

A

C

G

T

A

C

G

T

A

C

G

T

.8

A

C

G

T

1

A

C

G

T

A

C

G

T

1.

1.

1.

1.

.4

.8

.2

.8

.2

.2

.2

.8

.2

.8

A

C

G

T

A

C

G

T

A

C

G

T

.8

A

C

G

T

1

A

C

G

T

A

C

G

T

1.

1.

1.

1.

1.

.8

.2

.8

.2

.2

.2

.8

.2

ACA---ATG sco = 0.8x1x0.8x1x0.8x1x1x1x0.8x1x0.2 = 3.3x10-2 or

Log-sco = log(0.8)+log(0.8)+log(0.8)+log(1)+log(0.8)+log(0.2)

- In the case of un-gapped alignments HMM’s become simple weight matrices
- To achieve high performance, the emission frequencies are estimated using the techniques of
- Sequence weighting
- Pseudo counts

- Weight matrices do not deal with insertions and deletions
- In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension
- HMM is a natural frame work where insertions/deletions are dealt with explicitly

- Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner
- Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix)
- Profile HMM’s are ideal suited to describe such position specific variations

- Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

- Blosum62 score matrix. Fg=1. Ng=0?

- Blosum62 score matrix. Fg=1. Ng=0?
- Score =2+6+6+4-1=17

LAGDS

I-GDS

- Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences
- This scoring matrix is identical at all positions in the protein sequence!

EVVFIGDSLVQLMHQC

X

X

X

X

X

X

AGDS.GGGDS

1PLC._

1PLB._

1PLC._

1PMY._

- In reality not all positions in a protein are equally likely to mutate
- Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high
- Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score

- Sequence profiles can capture these differences

Non-conserved

Insertion

Conserved

Deletion

Must have a G

Any thing can match

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN

TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I

-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I

IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---

-TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V

ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Core: Position with < 2 gaps

- Detailed description of core
- Conserved/variable positions

- Price for insertions/deletions varies at different locations in sequence
- These features cannot be captured in conventional alignments

1K7C.A

1WAB._

All M/D pairs must be visited once

L1-Y2A3V4R5-I6

P1D2P3P4I4P5D6P7

- Alignment of protein sequences 1PLC._ and 1GYC.A
- E-value > 1000
- Profile alignment
- Align 1PLC._ against Swiss-prot
- Make position specific weight matrix from alignment
- Use this matrix to align 1PLC._ against 1GYC.A

- E-value < 10-22. Rmsd=3.3

Score = 97.1 bits (241), Expect = 9e-22

Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%)

Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56

F + G++ N+ + +G + +

Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79

Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98

A G +F G + ++ G+ G V

Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126

Rmsd=3.3 Å

Model red

Structure blue

- Trans membrane helix proteins

- Transmembrane helix proteins

TMHMM. A. Krogh, 2001

- HMMER(http://hmmer.wustl.edu/)
- S.R. Eddy, WashU St. Louis. Freely available.

- SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)
- R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users.

- META-MEME (http://metameme.sdsc.edu/)
- William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search.

- NET-ID, HMMpro(http://www.netid.com/html/hmmpro.html)
- Freely available to academia, nominal license fee for commercial users.
- Allows HMM architecture construction.

- EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)
- Webserver for Gibbs sampling of proteins sequences