Rna protein structures
This presentation is the property of its rightful owner.
Sponsored Links
1 / 84

RNA/Protein Structures PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on
  • Presentation posted in: General

RNA/Protein Structures. RNA structure. Stem-loop structure. RNA structure. A loop structure A loop between i and j when base at i pairs with base at j Base at i+1 pairs with at base j Or base at i pairs with base at j-1 Or a multiple loop. RNA secondary structure.

Download Presentation

RNA/Protein Structures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Rna protein structures

RNA/Protein Structures


Rna protein structures

RNA structure

  • Stem-loop structure


Rna protein structures

RNA structure

  • A loop structure

    • A loop between i and j when base at i pairs with base at j

    • Base at i+1 pairs with at base j

    • Or base at i pairs with base at j-1

    • Or a multiple loop


Rna protein structures

RNA secondary structure

  • Search for minimum free energy

    • Gibbs free energy at 37 degrees (C)

    • Free energy increments of base pairs are counted as stacks of adjacent pairs

      • Successive CGs: -3.3 kcal/mol

      • Unfavorable loop initiation energy to constrain bases in a loop


Rna protein structures

RNA structure prediction

  • Ad-hoc approach

    • Simply look at a strand and find areas where base pairing can occur

    • Possible to find many locations where folds can occur

    • Prediction should be able to determine the most likely one

      • What should be the criteria ?

  • 1980, Nussinov-Jacobson Algorithm

    • More stable one is the most likely structure

    • Find the fold that forms the greatest number of base pairs (base-pairing lowers the overall energy of the strand, more stable)

    • Checking for all possible folds is impossible -> dynamic programming


Rna protein structures

Amino Acid

  • General structure of amino acids

    • an amino group

    • a carboxyl group

    • α-carbon bonded to a hydrogen and a side-chain group, R

  • Side chain R determines the identity of particular amino acid

  • R: large white and gray

  • C: black

  • Nitrogen: blue

  • Oxygen: red

  • Hydrogen: white


Rna protein structures

Protein

  • Protein: polymer consisting of AA’s linked by peptide bonds

    • AA in a polymer is called a residue

  • Folded into 3D structures

  • Structure of protein determines its function

    • Primary structure: linear arrangement of AA’s

      • AA sequence (primary structure) determines 3D structure of a protein, which in turn determines its properties

      • N- and C-terminal

    • Secondary structure: short stretches of AAs

    • Tertiary structure: overall 3D structure


Rna protein structures

Protein Structures


Rna protein structures

Secondary structure

  • Secondary structures have repetitive interactions resulting from hydrogen bonding between N-H and carboxyl groups of peptide backbone

  • Conformations of side chains of AA are not part of the secondary structure

  • α-helix


Rna protein structures

  • β-pleated sheet

    • Parallel/antiparallel

    • 3D form of antiparallel


Rna protein structures

Secondary structure: domain

  • Part of chain folds independently of foldings of other parts

    • Such independent folded protion of protein is called domain (super-secondary structure)

  •  α  unit

  • α α unit (helix-turn-helix)

  •  meander

  • Greek key


Rna protein structures

Domain

  • Larger proteins are modular

    • Their structural units, domains or folds, can be covalently linked to generate multi-domain proteins

    • Domains are not only structurally, but also functionally, discrete units – domain family members are structurally and functionally conserved and recombined in complex ways during evolution

    • Domains can be seen as the units of evolution

    • Novelty in protein function often arises as a result of gain or loss of domains, or by re-shuffling existing domains along sequence

    • Pairs of protein domains with the same 3D fold, precise function is conserved to ~40% sequence identity (broad functional class is conserved ~20%)


Rna protein structures

Motif

  • Repetitive super-secondary structures is a motif (or module)

    • Greek key motif is often found in –barrel tertiary structure

  • complement control protein module

  • Immunoglobulin module

  • Fibronectin type I module

  • Growth factor module

  • Kringle module


Rna protein structures

Motif Representation

  • Motif

    • In multiple alignments of distinctly related sequences, highly conserved regions are called motifs, features, signatures or blocks

    • Tends to correspond to core structural and functional elements of the proteins


Rna protein structures

  • Linked series of -meanders

  • Greek key pattern

  • Alternative  α  untis

  • Top and side views (α-helical

  • section is outside)


Rna protein structures

Secondary structure: conformation

  • Two types of Protein Conformations

    • Fibrous

    • Globular –folds back onto itself to create a spherical shape

  • Schematic diagrams of fibrous and globular proteins

  • Computer-generated model of globular protein


Rna protein structures

SRC protein

  • Tyrosine kinase

  • Enzyme putting a phophate group on tyrosine AA (phosphorylation)

  • Activates an inactive protein, eventually activates cell-division proteins


Rna protein structures

Secondary Structure Prediction by PSIRED

  • Prediction of regions of the protein that form alpha-helix, beta-sheet, or random coil

  • NP_005408

>gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens]

MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS

DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV

APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL

DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC

FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL

DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT

ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC

PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL


Rna protein structures

Examining Crystal Structure

  • Cn3D: NCBI structure viewer and modeling tool

  • DeppView: SWISSPRROT

  • JMOL

  • NCBI Structure database

    • Links to NCBI MMDB (Molecular Modeling Database)

    • MMDB contains experimentally verified protein structures

    • SRC – MMDB ID 56157, PDB ID 1FMK

    • View Structure from NCBI Structure database

      • Opens up Cn3D window

      • Click to rotate; Ctrl_click to zoom; Shift_clcik to move

      • Rendering and coloring menus


Rna protein structures

Tertiary structure

  • 3D arrangment of all atoms in the module

  • Considers arrangement of helical and sheet sections, conformations of side chains, arrangement of atoms of side chains, etc.

  • Experimentally determined by

    • X-ray crystallography – measure diffraction patterns of atoms

    • NMR (Nuclear Magnetic Resonance) spectroscopy – use protein samples in aqueous solution


Rna protein structures

  • Tertiary structure of α-lactalbuminmyoglobin


Rna protein structures

Protein families

  • Groups of genes of identical or similar sequence are common

    • Sometimes, repetition of identical sequences is correlated with the synthesis of increased quantities of a gene product

      • e.g., a genome contains multiple copies of ribosomal RNAs

      • Human chromosome 1 has 2000 genes for 5S rRNA (sedimentation coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a repeat unit made up of 28S, 5.8S and 18S

    • Amplication of rRNA genes evolved because of heavy demand for rRNA synthesis during cell division

    • These rRNA genes are examples of protein families having identical or near identical sequences

    • Sequence similarities indicate a common evolutionary origin

    • α- and β-globin families have distinct sequence similarities evolved from a single ancestral globin gene


Rna protein structures

Protein families and superfamilies

  • Dayhoff classification, 1978

    • Protein families – at least 50 % AA sequence similar (based on physico-chemical AA features)

    • Related proteins with less similarity (35%) belong to a superfamily, may have quite diverse functions

    • α- and β-globins are classified as two separate families, and together with myoglobins form the globin superfamily

    • families have distinct sequence similarities evolved from a single ancestral globin gene


Rna protein structures

Protein family database

  • Pattern or secondary database derived from sequences

    • a pattern may be the most conserved aspects of sequence families

    • The most conserved part may vary between species

    • Use scoring system to account for some variability

    • Position-specific scoring matrix (PSSM) or Profile

      • Contrast to a pairwise alignment, having the same weight regardless of positions

  • Protein family databases are derived by different analytical techniques

    • But, trying to find motifs, conserved regions, considered to reflect shared structural or functional characteristics

    • Three groups: single motifs, multiple motifs, or full domain alignments


Rna protein structures

Protein family databases

  • Pattern or secondary database derived from sequences


Rna protein structures

Single Motif Method

  • Regular expression

    • PROSITE

    • PDB 1ivy

      • Carboxypet_Ser_His (PS00560)

      • [LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA]

        • [] – any of the enclosed symbols

        • X- any residue

        • (3) – number of repeats

  • Fuzzy regular expression

    • Build regular expressions with info on shared biochemical properties of AA

    • Provide flexibility according to AA group clustering


Rna protein structures

Multiple motif methods

  • PRINTS

    • Encode multiple motifs (called fingerprints) in ungapped, unweighted local aligments

  • BLOCKS

    • Derived from PROSITE and PRINTS

    • Use the most highly conserved regions in protein families in PROSITE

    • Use motif-finding algorithm to generate a large number of candidate blocks

    • Initially, three conserved AA positions anywhere in the alignment are identified and used as anchors

    • Blocks are iteratively extended and ultimately encoded as ungapped local alignments

    • Graph theory is used to assemble a best set of blocks for a given family

    • Use position specific scoring matrix (PSSM), similar to a profile


Rna protein structures

Full domain alignment

  • Profiles

    • Use family-based scoring matrix via dynamic programming

    • Has position-specific info on insertions and deletions in the sequence family

  • Hidden Markov Model (HMM)

    • PFAM, SMART, TIGRFAM represent full domain alignments as HMMs

    • PFAM

      • Represents each family as seed alignment, full alignment, and an HMM

      • Seed contains representative members of the family

      • Full alignment contains all members of the family as detected with HMM constructed from seen alignment


Rna protein structures

Hidden Markov Model (HMM)

  • Markov Process

    • Decomposed into a successive discrete states

    • e.g., first-order Markov process – a traffic light

    • Process states are not directly observable – spoken sounds vs. physical changes in vocal chords, position of tongue, etc.

  • Profile HMM

    • Discrete states correspond to successive columns of protein multiple sequence alignment

      • Match, insertion, deletion states

    • States have associated symbol emission probability distribution

    • Position-specific gap weight represents transition probability from indel to match


Rna protein structures

Protein StructureComparison/Classification


Rna protein structures

Protein structures

  • Domain

    • Polypeptide chain in a protein folds into a ‘tertiary’ structure

    • One or more compact globular regions called domains

    • The tertiary structure associated with a domain region is also described as a protein fold

  • Multi-domain

    • Proteins with polypeptide chains fold into several domains

    • Nearly half the known globular structures are multidomain, more than half in two domains

  • Automatic structure comparison methods are introduced in 1970s shortly after the first crystal structures are stored in PDB


Rna protein structures

Reasons for structural comparisons

  • Ligand binding

    • Binding of ligand or a substrate to an active-site in a protein induces a structural change which facilitates the reaction being catalyed at the site or promotes a binding of substrates at another site

    • Comparing bound and unbound structures of ligand sheds light on these processes and drug designs.

  • Distant evolutionary relationship

    • Protein structure is more highly conserved than sequence

    • Structure comparison can detect homologs with substantial changes

  • Structural variations in protein families

  • Identification of common structural motifs


Rna protein structures

Structure comparison algorithms

  • Two main components in structure comparison algorithms

    • Scoring similarities in structural features

    • Optimization strategy maximizing similarities measured

  • Most are based on geometric properties from 3D coordinates

    • Intermolecular method

      • Superpose structures by minimizing distance between superposed position

    • Intra

      • Compare sets of internal distances between positions to identify an alignment maximizing the number of equivalent positions

      • Distance is described by RMSD (Root Mean Square Deviation), squared root of the average squared distance between equivalent atoms


Rna protein structures

Inter vs. Intra


Rna protein structures

RMSD


Rna protein structures

Distant homolog

  • Structure is more conserved than sequences during evolution

  • Structural similarity between distant homologs can be found

    • Pairwise sequence similarity

    • SSAP structural similarity score in parenthesis (0 – 100)


Rna protein structures

Distant homolog


Rna protein structures

Structural variations in protein families


Rna protein structures

Structure comparison algorithms

  • SSAP, 1989

    • Residue level, Intra, Dynamic programming

  • DALI, 1993

    • Residue fragment level, intra, Monte Carlo optimization

  • COMPARER, 1990

    • Multiple element level, both, Dynamic programming


Rna protein structures

Structure classification

  • Most structure classifications are established at the domain level

    • Thought to be an important evolutionary unit and easier to determine domain boundaries from structural data than from sequence data

  • Criteria for assessing domain regions within a structure

    • The domain possesses a compact globular structure

    • Residues within a domain make more internal contacts than to residues in the rest of polypeptide

    • Secondary structure elements are usually not shared with other regions of the polypeptide

    • There is evidence for existence of this region as an evolutionary unit


Rna protein structures

CATH classifications


Rna protein structures

Multi-domain structures


Rna protein structures

Structure classification hierarchy

  • Class level -- proteins are grouped according to their structural class (composition of residues in a α -helical and β-strand conformations)

    • Mainly- α, mainly- β, alternating α- β, α plus β (mainly- α and – β are segregated)

  • Architecture

    • the manner by which secondary structure elements are packed together (arrangement of sec. structures in 3D space)

  • Fold group (topology)

    • Orientation of sec. structures and the connectivity between them

  • Superfamily

  • Family


Rna protein structures

Hierarchy example


Rna protein structures

Protein Structure databases

  • PDB

    • Over 20,000 entries deduced from X-ray diffraction, NMR or modeling

    • Massively redundant

      • 1FMK, 1BK5, 2F9C, ..

  • SCOP (Structural Classification of Proteins)

    • Multi-domain protein is split into its constituent domains

    • Known structures are classified according to evolutionary and structural relationship

    • Domains in SCOP are grouped by species and hierarchically classified into families, superfamilies, folds and classes

      • Family level – group together domains with celar sequence similarities

      • Superfamily – group of domains with structural and functional evidence for their descent from a common evolutionary ancestor

      • Gold – group of domains with the same major secondary structure with the same chain topology

    • Domains identified manually by visually inspecting structures


Rna protein structures

Protein Structure databases

  • SCOP (cont’d)

    • Proteins in the same superfamily often have the same function

  • CATH (Class, Architecture, Topology, Homology)

    • Homology – clustered domains with 35% sequence identity and shared common ancestry

    • 800 fold families, 10 of which are super-folds

    • 2009 www.cs.uml.edu/~kim/580/08_cath.pdf


Rna protein structures

Protein Function/StructurePrediction


Rna protein structures

Protein Function Prediction

  • In the absense of experimental data, function of a protein is usually inferred from its sequence similarity to a protein of known function

    • The more similar the sequence, the more similar the function is likely to be

    • Not always true

  • Can clues to function be derived directly from 3D structure

  • Definition of function

    • Function can be described at many levels: biochemical, biological processes, pathways, organ level

    • Proteins are annotated at different degrees of functional specificity: ubiquitin-like dome, signaling protein, ..

    • GO (Gene Ontology) scheme


Rna protein structures

Protein Function Prediction

  • Sequence-based – largely unreliable

  • Profile-based

    • Profiles are constructed from sequences of whole protein families with families are grouped by 3D structure or function (as in Pfam)

    • Start with sequences matched by an initial search, iteratively pull in more remote homologues

    • More sensitivity than simple sequence comparison because profiles implicitly contain information on which residues within the family are well conserved and which sites are more variable

  • Structure-based

    • Fold-based

      • Proteins sharing simlar functions often shave similar folds, resulting from descent from a common ancestral protein

      • Sometimes, function of proteins alter during evolution with the folds unchanged

      • Thus, fold match is not always reliable

    • Surface clefts and binding pockets


Rna protein structures

Structure-based Sequence Alignment

  • Well-known that sequence alignment is not correct by sequence similarity alone and that similar structure but no sequence similarity

  • Sequence alignment is augmented by structural alignments

    • COMPASS< HOMSTRAD< PALI, ..


Rna protein structures

Structure Prediction

  • Still an open problem

  • 1974 Peter Chou and Gerald Fasman

    • Propensity values : likelihood that an AA appears in a structure

      • P(a), P(b) and P(turn)

      • >1 indicates a greater than average chance

    • Frequency values: frequency of an AA being found in a hairpin

      • Four positions in a hairpin turn

    • Accuracy is around 50-60%, but popular due to its foundation for later prediction programs


Rna protein structures

AAP(a)P(b)P(turn) f(i) f(i+1) f(i+2) f(i+3)

Alanine14283660.0600.076 0.0350.058

Arginine9893950.0700.1060.0990.085

Asparagine6789950.1610.0830.1910.091

Aspartic acid101541460.1470.1100.1790.081

Cysteine701191190.1490.0500.1170.128

Glutamic acid15137740.0560.0600.0770.064

Glutamine111110980.0740.0980.0370.098

Glycine57751560.1020.0850.1900.152

Histidine10087950.1400.0470.0930.054

Isoleucine108160470.0430.0340.0130.054

Leucine121130590.0610.0250.0360.070

Lysine114741010.0550.1150.0720.095

Methionine145105600.0680.0820.0140.055

Pheylalanine113138600.0590.0410.0650.065

Proline57551520.1020.3010.0340.068

Serine77751430.1200.1390.1250.106

Threonine83119960.0860.1080.0650.079

Tryptophan108137960.0770.0130.0640.167

Tyrosine691471140.0820.0650.1140.125

Valine104170500.0620.0480.0280.053


Rna protein structures

Chou-Fasman Algorithm

  • Step 1: alpha-helices

    • Find a region of six contiguous residues where at least four have P(a)>103

    • Extend the region until a set offour contiguous residues with P(a)<100 is found

    • If region’s average P(a)>103, length is >5, and ∑P(a)> ∑P(b), alpha

  • Step 2: beta strands

    • Find a region of five contiguous residues with at least three with P(b)>105

    • Extend the region until a set of four contiguous residues with P(b)<100 is found

    • If region’s average P(b)>105, and ∑P(b)> ∑P(a), beta


Rna protein structures

Chou-Fasman Algorithm

  • Step 3: beta turns

    • For each residue f, determine the turn propensity (P(t)) for j, as

    • P(t)j = f(i)j*f(i+1) j+1 *f(i+2) j+2 *f(i+3) j+3

    • A turn at postion if P(t) >0.000075, average P(turn) from j to j+3 > 100, and ∑P(a)< ∑P(turn) > ∑P(b)

  • Step 4: overlaps

    • If alpha region overlaps with beta, the region’s ∑P(a) and ∑P(b) determine the most likely structure in the overlapped region

    • If ∑P(a) > ∑P(b) for the overlapping region, alpha

    • If ∑P(a) < ∑P(b) for the overlapping region, beta

    • If ∑P(a) = ∑P(b), no valid call


Rna protein structures

Secondary structure prediction

  • Chou and Fasman (1974) based on the frequencies of amino acids found in a helices, b-sheets, and turns.

  • Proline: occurs at turns, but not in a helices.

  • GOR (Garnier, Osguthorpe, Robson): related algorithm

  • Modern algorithms: use multiple sequence alignments and achieve higher success rate (about 70-75%)

Page 427


Rna protein structures

Secondary structure prediction

Web servers:

GOR4

Jpred

NNPREDICT

PHD

Predator

PredictProtein

PSIPRED

SAM-T99sec

Table 11-3

Page 429


Rna protein structures

Markov Model (MM)

  • Examine correlation in sequences

  • In a long sequence, suppose AA a is observed na times and that AA a is followed by b for nab times

    • rab = P(xi=b |xi-1=a) = nab/ na

    • First-order Markov model of a sequence is defined by

      • An alphabet

      • A matrix of conditional probs. rab

      • A set of frequencies for initial state, qa

    • Likelihood of a sequence x1, x2, ….xN according to the 1st order model is

      • L = qx(1) ∏Ni=2 rx(i-1) x(i)

    • If no correlation, rab = qb and L = ∏N qx(i) (zero-order MM)

    • kth-order MM

      • K=2, rabc = P(xi=c |xi-1=b, xi-2=a) = nabc/ nab

  • A letter is dependent upon preceding letters


Rna protein structures

Fair Bet Casino Problem

  • Dealer uses a fair coin, but occasionally switch to a biased coin

  • Given a sequence of coin tosses, determine when the dealer used a fair/biased coin

  • For n tosses with sequence x = x1 x2… xn

    • P(x|fair) = ∏ni=1 (1/2)**n

    • P(x|biased) = ∏ni=1 (q)**k (1-q)**(n-k)

    • Log-odds ratio

      • R = log P(x|fair)/P(x|biased)

    • q = ¾, R = n – k*log3

      • If R<0, biased coin


Rna protein structures

CG Islands

  • CG is the least frequent di-mer sequence

  • C in CG is easily methylated, can methyl-C tends to mutate to T

  • Methylation is often suppressed around genes in CG islands

  • Find CG islands in long DNA sequences

    • Calculate log-odds ratios of a sliding window of a certain length

    • And declare a CG islands if score is positive

  • Disadvantage of the approach

    • Do not have info of the CG island length in advance

    • => use HMM


Rna protein structures

Hidden Markov Model (HMM)

  • An abstract machine emitting symbols

  • At each discrete steps, HMM makes two decision

    • What is the next state

    • What symbol to emit

0.1

F

B

0.1

H

T

H

T


Rna protein structures

Coin Toss

  • Given a path P= FFFBBBBBFFF

  • And output x=01011101001

  • P(x|P) = ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½

  • P(P) = ½ 9/10 9/10 1/10 9/10 9/10 9/10 ….

  • Find a path that maximizes P(x|P) over all possible P


Rna protein structures

HMM of Loop/Helical

  • A letter depends on preceding letters AND on hidden state

    • In helical/loop problem, two hidden states: 0 for loop, 1 for helical

    • rB0, rB1 : probs. that 1st residue is loop or helical

    • r10, r11 : probs. of remaining in helical, or switching to loop

    • etc.


Rna protein structures

Hidden Markov Model (HMM)

  • Emission probs set to identical to AA frequencies

    • eo(a) = pl, e1(a) = ph

    • AAs occur independently as long as staying in either state 0 or 1 (zero-order)

    • Transition probs. Between hidden states is modeled by 1st-order

      • Values of transition probs. (r11, r10 ,…) control the relative frequency and relative lengths of the regions

      • If r01 is very small, difficult to initiate a new helical region, etc.

    • e.g.

    • Sequence xi: GHMESSLL KQT I NSWHLN

    • path i : B001 111000011110000E


Rna protein structures

Hidden Markov Model (HMM)

  • Path variables i describe the hidden states

  • Likelihood of the path i in the example,

    • L = [rB0 eo(G)] [r00 eo(H)] [r01 e1(M)] …. [r00 eo(T)] r0E

  • Model can be used to determine the most likely positions of helices and loops within the sequence (called decoding problem)

    • Two ways of doing this

      • Viterbi

        • Find the most probably path through the model, i.e., find the sequence of hidden states with the highest L

        • This gives a straightforward prediction that each site is either helix or loop

      • Forward/Backward

        • Consider all possible paths through the model, weighted according to their likelihood, and calculate the prob. that each site is in each of the hidden state


Rna protein structures

HMM Parameters

  • The most probably paths depend on emission and transition probs.

  • Parameters are determined from known structure info, and from these we can calculate ML values of the probs.

  • As in Profile Model, ML values of emission frequencies will be given by the observed frequencies

    • The simplest way of choosing pa is to use na/ntot (indeed, maximizes L)

  • If AA a occurs nka times in regions of state k, and the total number of residues in state k is nktot,

    • ek(a) = nka /nktot

  • If state j follows state kmkj times,

    • rkj = mkj /nktot (1st-order model)

  • If some transitions occur very rarely, or not at all, it is best to use prior info in choosing the frequencies (by adding pseudo-counts)


  • Rna protein structures

    HMM Parameters

    • If parameters are chosen from a set of known examples, this is referred to as supervised learning

    • Unsupervised learning maximizes the likelihood within the framework of the model, but without being told what parameters to learn

      • We can define a model with two hidden states, defined by two sets of frequencies, without specifying the meaning of the two states

      • The learning process will then determine the best way of partitioning the sequence into two different types of subsequences

      • The simplest way of implementing HMM with unsupervised learning is Viterbi training


    Rna protein structures

    Viterbi Algorithm

    • Start with an initial guess as to the model parameters

    • Calculate the values of hidden states on the most probable path for each sequence in the training set

    • From this, calculate nkaand mkj in most probable path


    Rna protein structures

    Viterbi Algorithm

    • Given sequence xi,

      • vk(i): likelihood of the most probable path for the first i letters in the sequence, given that the i-th letter is in state k

    • Initialize vk(1) = rBk ek(x1)

    • And vk(i) = maxh[vk(i-1) rhk ek(xi)] for i=2,…,N

    • vE(i) = maxh[vh(N) rhE] : likelihood of the best total path

    • A dynamic programming.


    Rna protein structures

    Baum-Welch Algorithm

    • Expectation-maximization, forward/backward

    • Expectation: calculate prob. P(i =k) that site i is in state k

      • Then, the expected value of letter a appearing in state k, averaged over all possible paths, E[nka] = ∑ P(i =k) over all probable paths with xi =a

      • Also, E[mkj] = ∑iP(i =k, i+1 =j)

    • Maximization: use E[nka] and E[mkj] into ek(a) = nka /nktot, rkj = mkj /nktot

    • Expectation-maximization is repeated until no change

    • P(i =k) ?


    Rna protein structures

    Baum-Welch Algorithm

    • P(i =k) ?

      • Forward

        • fk(1) = rBk ek(x1)

        • fk(i) = ek(xi)∑hfh(i-1) rhk (i=2,…,N)

        • Ltot = ∑hfh(N) rhE

      • Backward: bk(N) – sum of likelihoods of all paths from xi+1 to N

        • bk(N) = rkE

        • bk(i) =∑hrhk eh(xi) bh(i+1) (i=2,…,N)

        • Ltot = ∑hrBh eh(xi) bh(1)

      • Both Ltot has to be identical

      • ∑ P(i =k) = fk(i) bk(i)/ Ltot

      • ∑iP(i =k, i+1 =j) = fk(i) rkjej(xi+1) bj(i+1)/ Ltot


    Rna protein structures

    Helical/Loop Example

    • M1-M0 model

      • 1st-order transition between hidden states, 0-order independent letters

      • Also, ek(a) is dependent on hidden state k, but not on previous letter

      • Example of M1-M0 model

        • Occasionally dishonest casino, Durbin et al (1998)


    Rna protein structures

    Helical/Loop Example

    • HMM model

      • Krogh et al. (2001)


    Rna protein structures

    Coiled Coil Example

    • Coiled coils are associations of two or more α helices that wrap around each other

      • Found in many proteins, tropomyosin, hemagglutinin (influenza virus), DNA-binding transcription factors

      • About 3.5 residues per turn, leading to a repeating pattterns of seven residues (a heptad) in two turns

      • Two helices attract one another due to hydrophobic residues at sites a and d


    Rna protein structures

    Coiled Coil Example

    • Lupas, Vandyke, and Stock (1991)

      • Developed a profile score system dependent on the relative amino acid frequencies at each site (similar to Lprofile = ∏N pix and S = ln(Lpf/ L0) = ∑Nln(pix / px) in profile model)

      • Used a sliding window of 28 residues (four heptads)

    • Delorenzi and Speed (2002)

      • HMM with 9 groups of states and Beg/End

      • Each of 9 groups contains seven states representing seven possible positions in the helix

        • States in one group are linked to the state at the following helix position in the next group


    Rna protein structures

    Profile HMM

    • Profile technique

      • position-specific scores are used to describe aligned families of protein sequences

      • Drawback is the reliance on ad hoc scoring schemes

    • Profile HMM is developed to capture the info in an alignments

    • 2 3

    • W H . . E n

    • W H . . Y .

    • W - . . E .

    • S H . . E .

    • T H e . Y .

    • W H e r E .


    Rna protein structures

    Neural Networks

    • Simulate human nerve system

      • Neurons and synapse

      • Neuron puts out a real number between 0 and 1

    • Feedforward network

    • Typically 10-20 residues are input

    • Usually used in supervised learning


    Rna protein structures

    Single Neuron

    • Connection from input to a neuron has positive/negative weight wij

      • Total input xj = ∑iwijyi

      • Output yj = g(xj)

      • A sigmoid function: g(xj) = 1/[1 + exp(-xj)]

    • Single output with multiple inputs is called a perceptron


    Rna protein structures

    Perceptron Example

    • Two inputs, one output

    • Trained by

      • Total input xj = w1y1 + w2y2 + w0 (w0 is bias)

      • Assume a step function for g(xj)

    (y1 y2) → y

    (0, ½)1

    (1,1)1

    (1,1/2)0

    (0,0)0

    w2/2 + w0 > 0

    w1+ w2+ w0 > 0

    w1 + w2/2 + w0 < 0

    w0 < 0


    Rna protein structures

    Perceptron Example

    • Visualize

      • Can pick y2 = ¼ + 1/2 y1

        • -1/4 - 1/2 y1 + y2 > 0

      • w1 = -¼, w2 = -1/2, w0 = 1

    w2/2 + w0 > 0

    w1+ w2+ w0 > 0

    w1 + w2/2 + w0 < 0

    w0 < 0


    Rna protein structures

    Learning Algorithm

    • Backpropagation

      • Error at the output layer percolate down to the input layer

      • Weights are adjusted

      • Based on gradient descent method


    Rna protein structures

    NN Application

    • Protein structure prediction by PROF

      • Input layer

        • Sliding 15-residue window

        • Predict secondary structure of the central residue

        • One residue has 20 input nodes

      • Hidden layer

        • Connected to ALL input and output nodes


    Rna protein structures

    NN Application

    • Protein structure prediction by PHD (1993)

      • Based on 250 unique protein chains

        • Based on profile info

        • 20 AA + insertion + deletion + conservation weight

        • 13 columns are used as input

        • Connected to ALL input and output nodes


    Rna protein structures

    NN Application

    • Intron prediction

      • Intron splice site spans 15-60 nt

        • Organisms have unique codon usages at donor sites


  • Login