HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY. CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan. DNA. transcription. mRNA. translation. Protein. Relationship Between DNA, RNA And Proteins. CCTGAGCCAACTATTGATGAA.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
CS 594: An Introduction to Computational Molecular Biology
Relationship Between DNA, RNA And Proteins
Primary Structure of Proteins
The primary structure of peptides and proteins refers to the linear number and order of the amino acids present.
Protein secondary structure refers to regular, repeated patters of folding of the protein backbone. How a protein folds is largely dictated by the primary sequence of amino acids
State transition matrix : The probability of
the weather given the previous day's weather.
States : Three states - sunny, cloudy, rainy.
Initial Distribution : Defining the probability of the system being in each of the states at time 0.
Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather).
Observable states : the states of the process that are `visible' (e.g., seaweed dampness).
Output matrix : containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state.
Initial Distribution : contains the probability of the (hidden) model being in a particular hidden state at time t = 1.
State transition matrix : holding the probability of a hidden state given the previous hidden state.
Scoring a Sequence with an HMM:
The probability of ACCY along this path is
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.
Given an existing HMM and observed sequence , what is the probability that the HMM can generate the sequence
Given a large amount of data how can we estimate the structure and the parameters of the HMM that best accounts for the data
This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally.
ProteinWhat is a (protein-coding) gene?
An HMM for unspliced genes.
x : non-coding DNA
c : coding state
Sn = Sensitivity
Sp = Specificity
Ac = Approximate Correlation
ME = Missing Exons
WE = Wrong Exons
GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html
Build a Profile HMM (Training)
Query against Profile HMM database
Building – from an existing alignment
ACA - - - ATG
TCA ACT ATC
ACA C - - AGC
AGA - - - ATC
ACC G - - ATC
A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.
Building – Final Topology
No of matching states = average sequence length in the family
PFAM Database- of Protein families
Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities.
P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x
0.8x1 x 0.8 = 4.7 x 10 -2
ACAC - - ATC
An alignment of 30 short amino acid sequences chopped out of a alignment of the SH3 domain. The shaded area are themost conserved and were represented by the main states in the HMM. The unshaded area was represented by an insert state.