Hidden markov models in computational biology
Download
1 / 35

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY. CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan. DNA. transcription. mRNA. translation. Protein. Relationship Between DNA, RNA And Proteins. CCTGAGCCAACTATTGATGAA.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY' - JasminFlorian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hidden markov models in computational biology

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

CS 594: An Introduction to Computational Molecular Biology

BY

Shalini Venkataraman

Vidhya Gunaseelan


DNA

transcription

mRNA

translation

Protein

Relationship Between DNA, RNA And Proteins

CCTGAGCCAACTATTGATGAA

CCUGAGCCAACUAUUGAUGAA

PEPTIDE


Protein structure
Protein Structure

Primary Structure of Proteins

The primary structure of peptides and proteins refers to the linear number and order of the amino acids present.


Protein structure1
Protein Structure

Secondary Structure

Protein secondary structure refers to regular, repeated patters of folding of the protein backbone. How a protein folds is largely dictated by the primary sequence of amino acids

Beta Sheet

Alpha Helix


Multiple alignment process
Multiple Alignment Process

  • Process of aligning three or more sequences with each other

  • Generalization of the algorithm to align two sequences

  • Local multiple alignment uses Sum of pairs scoring scheme


Hmm architecture
HMM Architecture

  • Markov Chains

  • What is a Hidden Markov Model(HMM)?

  • Components of HMM

  • Problems of HMMs


Markov chains
Markov Chains

Rain

Sunny

Cloudy

State transition matrix : The probability of

the weather given the previous day's weather.

States : Three states - sunny, cloudy, rainy.

Initial Distribution : Defining the probability of the system being in each of the states at time 0.


Hidden markov models
Hidden Markov Models

Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather).

Observable states : the states of the process that are `visible' (e.g., seaweed dampness).


Components of hmm
Components Of HMM

Output matrix : containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state.

Initial Distribution : contains the probability of the (hidden) model being in a particular hidden state at time t = 1.

State transition matrix : holding the probability of a hidden state given the previous hidden state.


Example hmm
Example-HMM

Transition Prob.

Output Prob.

Scoring a Sequence with an HMM:

The probability of ACCY along this path is

.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.


Problems with hmm
Problems With HMM

Scoring problem:

Given an existing HMM and observed sequence , what is the probability that the HMM can generate the sequence


Problems with hmm1
Problems With HMM

  • Alignment Problem

  • Given a sequence, what is the optimal state sequence that the HMM would use to generate it


Problems with hmm2
Problems With HMM

Training Problem

Given a large amount of data how can we estimate the structure and the parameters of the HMM that best accounts for the data


Hmms in biology
HMMs in Biology

  • Gene finding and prediction

  • Protein-Profile Analysis

  • Secondary Structure prediction

  • Advantages

  • Limitations


Finding genes in DNA sequence

This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.


What is a protein coding gene

DNA

transcription

mRNA

translation

Protein

What is a (protein-coding) gene?

CCTGAGCCAACTATTGATGAA

CCUGAGCCAACUAUUGAUGAA

PEPTIDE


In more detail

(color ~state)

(Left)

(Removed)


Gene finding hmms
Gene Finding HMMs

  • Our Objective:

    • To find the coding and non-coding regions of an unlabeled string of DNA nucleotides

  • Our Motivation:

    • Assist in the annotation of genomic data produced by genome sequencing methods

    • Gain insight into the mechanisms involved in transcription, splicing and other processes


Why hmms
Why HMMs

  • Classification: Classifying observations within a sequence

  • Order: A DNA sequence is a set of ordered observations

  • Grammar : Our grammatical structure (and the beginnings of our architecture) is right here:

  • Success measure: # of complete exons correctly labeled

  • Training data: Available from various genome annotation projects


Hmms for gene finding
HMMs for gene finding

  • Training- Expectation Maximization (EM)

  • Parsing – Viterbi algorithm

An HMM for unspliced genes.

x : non-coding DNA

c : coding state


Genefinders a comparison
Genefinders- a comparison

Sn = Sensitivity

Sp = Specificity

Ac = Approximate Correlation

ME = Missing Exons

WE = Wrong Exons

GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html


Protein profile hmms
Protein Profile HMMs

  • Motivation

    • Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein. Use Profile Similarity

  • What is a Profile?

    • Proteins families of related sequences and structures

    • Same function

    • Clear evolutionary relationship

    • Patterns of conservation, some positions are more conserved than the others


An Overview

Aligned Sequences

Build a Profile HMM (Training)

Database search

Query against Profile HMM database

(Forward)

Multiple alignments

(Viterbi)


Building – from an existing alignment

ACA - - - ATG

TCA ACT ATC

ACA C - - AGC

AGA - - - ATC

ACC G - - ATC

insertion

Transition probabilities

Output Probabilities

A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.


Building – Final Topology

Deletion states

Matching states

Insertion states

No of matching states = average sequence length in the family

PFAM Database- of Protein families

(http://pfam.wustl.edu)


Database Searching

  • Given HMM, M, for a sequence family, find all members of the family in data base.

  • LL – score LL(x) = log P(x|M)

    • (LL score is length dependent – must normalize or use Z-score)


Query a new sequence

Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities.

Consensus sequence:

P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x

0.8x1 x 0.8 = 4.7 x 10 -2

ACAC - - ATC


Multiple alignments
Multiple Alignments

  • Try every possible path through the model that would produce the target sequences

    • Keep the best one and its probability.

    • Output : Sequence of match, insert and delete states

  • Viterbi alg. Dynamic Programming


Building unaligned sequences
Building – unaligned sequences

  • Baum-Welch Expectation-maximization method

    • Start with a model whose length matches the average length of the sequences and with random output and transition probabilities.

    • Align all the sequences to the model.

    • Use the alignment to alter the output and transition probabilities

    • Repeat. Continue until the model stops changing

  • By-product: It produced a multiple alignment


Phmm example
PHMM Example

An alignment of 30 short amino acid sequences chopped out of a alignment of the SH3 domain. The shaded area are themost conserved and were represented by the main states in the HMM. The unshaded area was represented by an insert state.


Prediction of protein secondary structures
Prediction of Protein Secondary structures

  • Prediction of secondary structures is needed for the prediction of protein function.

  • Analyze the amino-acid sequences of proteins

  • Learn secondary structures

    • helix, sheet and turn

  • Predict the secondary structures of sequences


Advantages
Advantages

  • Characterize an entire family of sequences.

  • Position-dependent character distributions and position-dependent insertion and deletion gap penalties.

  • Built on a formal probabilistic basis

  • Can make libraries of hundreds of profile HMMs and apply them on a large scale (whole genome)


Limitations
Limitations

  • Markov Chains

    • Probabilities of states are supposed to be independent

    • P(y) must be independent of P(x), and vice versa

    • This usually isn’t true

P(x)

P(y)


Limitations contd
Limitations - contd

  • Standard Machine Learning Problems

  • Watch out for local maxima

    • Model may not converge to a truly optimal parameter set for a given training set

  • Avoid over-fitting

    • You’re only as good as your training set

    • More training is not always good


Conclusion
CONCLUSION

  • For links & slides

    • www.evl.uic.edu/shalini/hmm/


ad