Rna protein structures
Sponsored Links
This presentation is the property of its rightful owner.
1 / 84

RNA/Protein Structures PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

RNA/Protein Structures. RNA structure. Stem-loop structure. RNA structure. A loop structure A loop between i and j when base at i pairs with base at j Base at i+1 pairs with at base j Or base at i pairs with base at j-1 Or a multiple loop. RNA secondary structure.

Download Presentation

RNA/Protein Structures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


RNA/Protein Structures


RNA structure

  • Stem-loop structure


RNA structure

  • A loop structure

    • A loop between i and j when base at i pairs with base at j

    • Base at i+1 pairs with at base j

    • Or base at i pairs with base at j-1

    • Or a multiple loop


RNA secondary structure

  • Search for minimum free energy

    • Gibbs free energy at 37 degrees (C)

    • Free energy increments of base pairs are counted as stacks of adjacent pairs

      • Successive CGs: -3.3 kcal/mol

      • Unfavorable loop initiation energy to constrain bases in a loop


RNA structure prediction

  • Ad-hoc approach

    • Simply look at a strand and find areas where base pairing can occur

    • Possible to find many locations where folds can occur

    • Prediction should be able to determine the most likely one

      • What should be the criteria ?

  • 1980, Nussinov-Jacobson Algorithm

    • More stable one is the most likely structure

    • Find the fold that forms the greatest number of base pairs (base-pairing lowers the overall energy of the strand, more stable)

    • Checking for all possible folds is impossible -> dynamic programming


Amino Acid

  • General structure of amino acids

    • an amino group

    • a carboxyl group

    • α-carbon bonded to a hydrogen and a side-chain group, R

  • Side chain R determines the identity of particular amino acid

  • R: large white and gray

  • C: black

  • Nitrogen: blue

  • Oxygen: red

  • Hydrogen: white


Protein

  • Protein: polymer consisting of AA’s linked by peptide bonds

    • AA in a polymer is called a residue

  • Folded into 3D structures

  • Structure of protein determines its function

    • Primary structure: linear arrangement of AA’s

      • AA sequence (primary structure) determines 3D structure of a protein, which in turn determines its properties

      • N- and C-terminal

    • Secondary structure: short stretches of AAs

    • Tertiary structure: overall 3D structure


Protein Structures


Secondary structure

  • Secondary structures have repetitive interactions resulting from hydrogen bonding between N-H and carboxyl groups of peptide backbone

  • Conformations of side chains of AA are not part of the secondary structure

  • α-helix


  • β-pleated sheet

    • Parallel/antiparallel

    • 3D form of antiparallel


Secondary structure: domain

  • Part of chain folds independently of foldings of other parts

    • Such independent folded protion of protein is called domain (super-secondary structure)

  •  α  unit

  • α α unit (helix-turn-helix)

  •  meander

  • Greek key


Domain

  • Larger proteins are modular

    • Their structural units, domains or folds, can be covalently linked to generate multi-domain proteins

    • Domains are not only structurally, but also functionally, discrete units – domain family members are structurally and functionally conserved and recombined in complex ways during evolution

    • Domains can be seen as the units of evolution

    • Novelty in protein function often arises as a result of gain or loss of domains, or by re-shuffling existing domains along sequence

    • Pairs of protein domains with the same 3D fold, precise function is conserved to ~40% sequence identity (broad functional class is conserved ~20%)


Motif

  • Repetitive super-secondary structures is a motif (or module)

    • Greek key motif is often found in –barrel tertiary structure

  • complement control protein module

  • Immunoglobulin module

  • Fibronectin type I module

  • Growth factor module

  • Kringle module


Motif Representation

  • Motif

    • In multiple alignments of distinctly related sequences, highly conserved regions are called motifs, features, signatures or blocks

    • Tends to correspond to core structural and functional elements of the proteins


  • Linked series of -meanders

  • Greek key pattern

  • Alternative  α  untis

  • Top and side views (α-helical

  • section is outside)


Secondary structure: conformation

  • Two types of Protein Conformations

    • Fibrous

    • Globular –folds back onto itself to create a spherical shape

  • Schematic diagrams of fibrous and globular proteins

  • Computer-generated model of globular protein


SRC protein

  • Tyrosine kinase

  • Enzyme putting a phophate group on tyrosine AA (phosphorylation)

  • Activates an inactive protein, eventually activates cell-division proteins


Secondary Structure Prediction by PSIRED

  • Prediction of regions of the protein that form alpha-helix, beta-sheet, or random coil

  • NP_005408

>gi|4885609|ref|NP_005408.1| proto-oncogene tyrosine-protein kinase Src [Homo sapiens]

MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSS

DTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV

APSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKL

DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC

FGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLL

DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYT

ARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC

PESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL


Examining Crystal Structure

  • Cn3D: NCBI structure viewer and modeling tool

  • DeppView: SWISSPRROT

  • JMOL

  • NCBI Structure database

    • Links to NCBI MMDB (Molecular Modeling Database)

    • MMDB contains experimentally verified protein structures

    • SRC – MMDB ID 56157, PDB ID 1FMK

    • View Structure from NCBI Structure database

      • Opens up Cn3D window

      • Click to rotate; Ctrl_click to zoom; Shift_clcik to move

      • Rendering and coloring menus


Tertiary structure

  • 3D arrangment of all atoms in the module

  • Considers arrangement of helical and sheet sections, conformations of side chains, arrangement of atoms of side chains, etc.

  • Experimentally determined by

    • X-ray crystallography – measure diffraction patterns of atoms

    • NMR (Nuclear Magnetic Resonance) spectroscopy – use protein samples in aqueous solution


  • Tertiary structure of α-lactalbuminmyoglobin


Protein families

  • Groups of genes of identical or similar sequence are common

    • Sometimes, repetition of identical sequences is correlated with the synthesis of increased quantities of a gene product

      • e.g., a genome contains multiple copies of ribosomal RNAs

      • Human chromosome 1 has 2000 genes for 5S rRNA (sedimentation coefficient), and chr 13, 14, 15, 21 and 22 have 280 copies of a repeat unit made up of 28S, 5.8S and 18S

    • Amplication of rRNA genes evolved because of heavy demand for rRNA synthesis during cell division

    • These rRNA genes are examples of protein families having identical or near identical sequences

    • Sequence similarities indicate a common evolutionary origin

    • α- and β-globin families have distinct sequence similarities evolved from a single ancestral globin gene


Protein families and superfamilies

  • Dayhoff classification, 1978

    • Protein families – at least 50 % AA sequence similar (based on physico-chemical AA features)

    • Related proteins with less similarity (35%) belong to a superfamily, may have quite diverse functions

    • α- and β-globins are classified as two separate families, and together with myoglobins form the globin superfamily

    • families have distinct sequence similarities evolved from a single ancestral globin gene


Protein family database

  • Pattern or secondary database derived from sequences

    • a pattern may be the most conserved aspects of sequence families

    • The most conserved part may vary between species

    • Use scoring system to account for some variability

    • Position-specific scoring matrix (PSSM) or Profile

      • Contrast to a pairwise alignment, having the same weight regardless of positions

  • Protein family databases are derived by different analytical techniques

    • But, trying to find motifs, conserved regions, considered to reflect shared structural or functional characteristics

    • Three groups: single motifs, multiple motifs, or full domain alignments


Protein family databases

  • Pattern or secondary database derived from sequences


Single Motif Method

  • Regular expression

    • PROSITE

    • PDB 1ivy

      • Carboxypet_Ser_His (PS00560)

      • [LIVF]-x2-[[LIVSTA]-x[IVPST]-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA]

        • [] – any of the enclosed symbols

        • X- any residue

        • (3) – number of repeats

  • Fuzzy regular expression

    • Build regular expressions with info on shared biochemical properties of AA

    • Provide flexibility according to AA group clustering


Multiple motif methods

  • PRINTS

    • Encode multiple motifs (called fingerprints) in ungapped, unweighted local aligments

  • BLOCKS

    • Derived from PROSITE and PRINTS

    • Use the most highly conserved regions in protein families in PROSITE

    • Use motif-finding algorithm to generate a large number of candidate blocks

    • Initially, three conserved AA positions anywhere in the alignment are identified and used as anchors

    • Blocks are iteratively extended and ultimately encoded as ungapped local alignments

    • Graph theory is used to assemble a best set of blocks for a given family

    • Use position specific scoring matrix (PSSM), similar to a profile


Full domain alignment

  • Profiles

    • Use family-based scoring matrix via dynamic programming

    • Has position-specific info on insertions and deletions in the sequence family

  • Hidden Markov Model (HMM)

    • PFAM, SMART, TIGRFAM represent full domain alignments as HMMs

    • PFAM

      • Represents each family as seed alignment, full alignment, and an HMM

      • Seed contains representative members of the family

      • Full alignment contains all members of the family as detected with HMM constructed from seen alignment


Hidden Markov Model (HMM)

  • Markov Process

    • Decomposed into a successive discrete states

    • e.g., first-order Markov process – a traffic light

    • Process states are not directly observable – spoken sounds vs. physical changes in vocal chords, position of tongue, etc.

  • Profile HMM

    • Discrete states correspond to successive columns of protein multiple sequence alignment

      • Match, insertion, deletion states

    • States have associated symbol emission probability distribution

    • Position-specific gap weight represents transition probability from indel to match


Protein StructureComparison/Classification


Protein structures

  • Domain

    • Polypeptide chain in a protein folds into a ‘tertiary’ structure

    • One or more compact globular regions called domains

    • The tertiary structure associated with a domain region is also described as a protein fold

  • Multi-domain

    • Proteins with polypeptide chains fold into several domains

    • Nearly half the known globular structures are multidomain, more than half in two domains

  • Automatic structure comparison methods are introduced in 1970s shortly after the first crystal structures are stored in PDB


Reasons for structural comparisons

  • Ligand binding

    • Binding of ligand or a substrate to an active-site in a protein induces a structural change which facilitates the reaction being catalyed at the site or promotes a binding of substrates at another site

    • Comparing bound and unbound structures of ligand sheds light on these processes and drug designs.

  • Distant evolutionary relationship

    • Protein structure is more highly conserved than sequence

    • Structure comparison can detect homologs with substantial changes

  • Structural variations in protein families

  • Identification of common structural motifs


Structure comparison algorithms

  • Two main components in structure comparison algorithms

    • Scoring similarities in structural features

    • Optimization strategy maximizing similarities measured

  • Most are based on geometric properties from 3D coordinates

    • Intermolecular method

      • Superpose structures by minimizing distance between superposed position

    • Intra

      • Compare sets of internal distances between positions to identify an alignment maximizing the number of equivalent positions

      • Distance is described by RMSD (Root Mean Square Deviation), squared root of the average squared distance between equivalent atoms


Inter vs. Intra


RMSD


Distant homolog

  • Structure is more conserved than sequences during evolution

  • Structural similarity between distant homologs can be found

    • Pairwise sequence similarity

    • SSAP structural similarity score in parenthesis (0 – 100)


Distant homolog


Structural variations in protein families


Structure comparison algorithms

  • SSAP, 1989

    • Residue level, Intra, Dynamic programming

  • DALI, 1993

    • Residue fragment level, intra, Monte Carlo optimization

  • COMPARER, 1990

    • Multiple element level, both, Dynamic programming


Structure classification

  • Most structure classifications are established at the domain level

    • Thought to be an important evolutionary unit and easier to determine domain boundaries from structural data than from sequence data

  • Criteria for assessing domain regions within a structure

    • The domain possesses a compact globular structure

    • Residues within a domain make more internal contacts than to residues in the rest of polypeptide

    • Secondary structure elements are usually not shared with other regions of the polypeptide

    • There is evidence for existence of this region as an evolutionary unit


CATH classifications


Multi-domain structures


Structure classification hierarchy

  • Class level -- proteins are grouped according to their structural class (composition of residues in a α -helical and β-strand conformations)

    • Mainly- α, mainly- β, alternating α- β, α plus β (mainly- α and – β are segregated)

  • Architecture

    • the manner by which secondary structure elements are packed together (arrangement of sec. structures in 3D space)

  • Fold group (topology)

    • Orientation of sec. structures and the connectivity between them

  • Superfamily

  • Family


Hierarchy example


Protein Structure databases

  • PDB

    • Over 20,000 entries deduced from X-ray diffraction, NMR or modeling

    • Massively redundant

      • 1FMK, 1BK5, 2F9C, ..

  • SCOP (Structural Classification of Proteins)

    • Multi-domain protein is split into its constituent domains

    • Known structures are classified according to evolutionary and structural relationship

    • Domains in SCOP are grouped by species and hierarchically classified into families, superfamilies, folds and classes

      • Family level – group together domains with celar sequence similarities

      • Superfamily – group of domains with structural and functional evidence for their descent from a common evolutionary ancestor

      • Gold – group of domains with the same major secondary structure with the same chain topology

    • Domains identified manually by visually inspecting structures


Protein Structure databases

  • SCOP (cont’d)

    • Proteins in the same superfamily often have the same function

  • CATH (Class, Architecture, Topology, Homology)

    • Homology – clustered domains with 35% sequence identity and shared common ancestry

    • 800 fold families, 10 of which are super-folds

    • 2009 www.cs.uml.edu/~kim/580/08_cath.pdf


Protein Function/StructurePrediction


Protein Function Prediction

  • In the absense of experimental data, function of a protein is usually inferred from its sequence similarity to a protein of known function

    • The more similar the sequence, the more similar the function is likely to be

    • Not always true

  • Can clues to function be derived directly from 3D structure

  • Definition of function

    • Function can be described at many levels: biochemical, biological processes, pathways, organ level

    • Proteins are annotated at different degrees of functional specificity: ubiquitin-like dome, signaling protein, ..

    • GO (Gene Ontology) scheme


Protein Function Prediction

  • Sequence-based – largely unreliable

  • Profile-based

    • Profiles are constructed from sequences of whole protein families with families are grouped by 3D structure or function (as in Pfam)

    • Start with sequences matched by an initial search, iteratively pull in more remote homologues

    • More sensitivity than simple sequence comparison because profiles implicitly contain information on which residues within the family are well conserved and which sites are more variable

  • Structure-based

    • Fold-based

      • Proteins sharing simlar functions often shave similar folds, resulting from descent from a common ancestral protein

      • Sometimes, function of proteins alter during evolution with the folds unchanged

      • Thus, fold match is not always reliable

    • Surface clefts and binding pockets


Structure-based Sequence Alignment

  • Well-known that sequence alignment is not correct by sequence similarity alone and that similar structure but no sequence similarity

  • Sequence alignment is augmented by structural alignments

    • COMPASS< HOMSTRAD< PALI, ..


Structure Prediction

  • Still an open problem

  • 1974 Peter Chou and Gerald Fasman

    • Propensity values : likelihood that an AA appears in a structure

      • P(a), P(b) and P(turn)

      • >1 indicates a greater than average chance

    • Frequency values: frequency of an AA being found in a hairpin

      • Four positions in a hairpin turn

    • Accuracy is around 50-60%, but popular due to its foundation for later prediction programs


AAP(a)P(b)P(turn) f(i) f(i+1) f(i+2) f(i+3)

Alanine14283660.0600.076 0.0350.058

Arginine9893950.0700.1060.0990.085

Asparagine6789950.1610.0830.1910.091

Aspartic acid101541460.1470.1100.1790.081

Cysteine701191190.1490.0500.1170.128

Glutamic acid15137740.0560.0600.0770.064

Glutamine111110980.0740.0980.0370.098

Glycine57751560.1020.0850.1900.152

Histidine10087950.1400.0470.0930.054

Isoleucine108160470.0430.0340.0130.054

Leucine121130590.0610.0250.0360.070

Lysine114741010.0550.1150.0720.095

Methionine145105600.0680.0820.0140.055

Pheylalanine113138600.0590.0410.0650.065

Proline57551520.1020.3010.0340.068

Serine77751430.1200.1390.1250.106

Threonine83119960.0860.1080.0650.079

Tryptophan108137960.0770.0130.0640.167

Tyrosine691471140.0820.0650.1140.125

Valine104170500.0620.0480.0280.053


Chou-Fasman Algorithm

  • Step 1: alpha-helices

    • Find a region of six contiguous residues where at least four have P(a)>103

    • Extend the region until a set offour contiguous residues with P(a)<100 is found

    • If region’s average P(a)>103, length is >5, and ∑P(a)> ∑P(b), alpha

  • Step 2: beta strands

    • Find a region of five contiguous residues with at least three with P(b)>105

    • Extend the region until a set of four contiguous residues with P(b)<100 is found

    • If region’s average P(b)>105, and ∑P(b)> ∑P(a), beta


Chou-Fasman Algorithm

  • Step 3: beta turns

    • For each residue f, determine the turn propensity (P(t)) for j, as

    • P(t)j = f(i)j*f(i+1) j+1 *f(i+2) j+2 *f(i+3) j+3

    • A turn at postion if P(t) >0.000075, average P(turn) from j to j+3 > 100, and ∑P(a)< ∑P(turn) > ∑P(b)

  • Step 4: overlaps

    • If alpha region overlaps with beta, the region’s ∑P(a) and ∑P(b) determine the most likely structure in the overlapped region

    • If ∑P(a) > ∑P(b) for the overlapping region, alpha

    • If ∑P(a) < ∑P(b) for the overlapping region, beta

    • If ∑P(a) = ∑P(b), no valid call


Secondary structure prediction

  • Chou and Fasman (1974) based on the frequencies of amino acids found in a helices, b-sheets, and turns.

  • Proline: occurs at turns, but not in a helices.

  • GOR (Garnier, Osguthorpe, Robson): related algorithm

  • Modern algorithms: use multiple sequence alignments and achieve higher success rate (about 70-75%)

Page 427


Secondary structure prediction

Web servers:

GOR4

Jpred

NNPREDICT

PHD

Predator

PredictProtein

PSIPRED

SAM-T99sec

Table 11-3

Page 429


Markov Model (MM)

  • Examine correlation in sequences

  • In a long sequence, suppose AA a is observed na times and that AA a is followed by b for nab times

    • rab = P(xi=b |xi-1=a) = nab/ na

    • First-order Markov model of a sequence is defined by

      • An alphabet

      • A matrix of conditional probs. rab

      • A set of frequencies for initial state, qa

    • Likelihood of a sequence x1, x2, ….xN according to the 1st order model is

      • L = qx(1) ∏Ni=2 rx(i-1) x(i)

    • If no correlation, rab = qb and L = ∏N qx(i) (zero-order MM)

    • kth-order MM

      • K=2, rabc = P(xi=c |xi-1=b, xi-2=a) = nabc/ nab

  • A letter is dependent upon preceding letters


Fair Bet Casino Problem

  • Dealer uses a fair coin, but occasionally switch to a biased coin

  • Given a sequence of coin tosses, determine when the dealer used a fair/biased coin

  • For n tosses with sequence x = x1 x2… xn

    • P(x|fair) = ∏ni=1 (1/2)**n

    • P(x|biased) = ∏ni=1 (q)**k (1-q)**(n-k)

    • Log-odds ratio

      • R = log P(x|fair)/P(x|biased)

    • q = ¾, R = n – k*log3

      • If R<0, biased coin


CG Islands

  • CG is the least frequent di-mer sequence

  • C in CG is easily methylated, can methyl-C tends to mutate to T

  • Methylation is often suppressed around genes in CG islands

  • Find CG islands in long DNA sequences

    • Calculate log-odds ratios of a sliding window of a certain length

    • And declare a CG islands if score is positive

  • Disadvantage of the approach

    • Do not have info of the CG island length in advance

    • => use HMM


Hidden Markov Model (HMM)

  • An abstract machine emitting symbols

  • At each discrete steps, HMM makes two decision

    • What is the next state

    • What symbol to emit

0.1

F

B

0.1

H

T

H

T


Coin Toss

  • Given a path P= FFFBBBBBFFF

  • And output x=01011101001

  • P(x|P) = ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½

  • P(P) = ½ 9/10 9/10 1/10 9/10 9/10 9/10 ….

  • Find a path that maximizes P(x|P) over all possible P


HMM of Loop/Helical

  • A letter depends on preceding letters AND on hidden state

    • In helical/loop problem, two hidden states: 0 for loop, 1 for helical

    • rB0, rB1 : probs. that 1st residue is loop or helical

    • r10, r11 : probs. of remaining in helical, or switching to loop

    • etc.


Hidden Markov Model (HMM)

  • Emission probs set to identical to AA frequencies

    • eo(a) = pl, e1(a) = ph

    • AAs occur independently as long as staying in either state 0 or 1 (zero-order)

    • Transition probs. Between hidden states is modeled by 1st-order

      • Values of transition probs. (r11, r10 ,…) control the relative frequency and relative lengths of the regions

      • If r01 is very small, difficult to initiate a new helical region, etc.

    • e.g.

    • Sequence xi: GHMESSLL KQT I NSWHLN

    • path i : B001 111000011110000E


Hidden Markov Model (HMM)

  • Path variables i describe the hidden states

  • Likelihood of the path i in the example,

    • L = [rB0 eo(G)] [r00 eo(H)] [r01 e1(M)] …. [r00 eo(T)] r0E

  • Model can be used to determine the most likely positions of helices and loops within the sequence (called decoding problem)

    • Two ways of doing this

      • Viterbi

        • Find the most probably path through the model, i.e., find the sequence of hidden states with the highest L

        • This gives a straightforward prediction that each site is either helix or loop

      • Forward/Backward

        • Consider all possible paths through the model, weighted according to their likelihood, and calculate the prob. that each site is in each of the hidden state


HMM Parameters

  • The most probably paths depend on emission and transition probs.

  • Parameters are determined from known structure info, and from these we can calculate ML values of the probs.

  • As in Profile Model, ML values of emission frequencies will be given by the observed frequencies

    • The simplest way of choosing pa is to use na/ntot (indeed, maximizes L)

  • If AA a occurs nka times in regions of state k, and the total number of residues in state k is nktot,

    • ek(a) = nka /nktot

  • If state j follows state kmkj times,

    • rkj = mkj /nktot (1st-order model)

  • If some transitions occur very rarely, or not at all, it is best to use prior info in choosing the frequencies (by adding pseudo-counts)


  • HMM Parameters

    • If parameters are chosen from a set of known examples, this is referred to as supervised learning

    • Unsupervised learning maximizes the likelihood within the framework of the model, but without being told what parameters to learn

      • We can define a model with two hidden states, defined by two sets of frequencies, without specifying the meaning of the two states

      • The learning process will then determine the best way of partitioning the sequence into two different types of subsequences

      • The simplest way of implementing HMM with unsupervised learning is Viterbi training


    Viterbi Algorithm

    • Start with an initial guess as to the model parameters

    • Calculate the values of hidden states on the most probable path for each sequence in the training set

    • From this, calculate nkaand mkj in most probable path


    Viterbi Algorithm

    • Given sequence xi,

      • vk(i): likelihood of the most probable path for the first i letters in the sequence, given that the i-th letter is in state k

    • Initialize vk(1) = rBk ek(x1)

    • And vk(i) = maxh[vk(i-1) rhk ek(xi)] for i=2,…,N

    • vE(i) = maxh[vh(N) rhE] : likelihood of the best total path

    • A dynamic programming.


    Baum-Welch Algorithm

    • Expectation-maximization, forward/backward

    • Expectation: calculate prob. P(i =k) that site i is in state k

      • Then, the expected value of letter a appearing in state k, averaged over all possible paths, E[nka] = ∑ P(i =k) over all probable paths with xi =a

      • Also, E[mkj] = ∑iP(i =k, i+1 =j)

    • Maximization: use E[nka] and E[mkj] into ek(a) = nka /nktot, rkj = mkj /nktot

    • Expectation-maximization is repeated until no change

    • P(i =k) ?


    Baum-Welch Algorithm

    • P(i =k) ?

      • Forward

        • fk(1) = rBk ek(x1)

        • fk(i) = ek(xi)∑hfh(i-1) rhk (i=2,…,N)

        • Ltot = ∑hfh(N) rhE

      • Backward: bk(N) – sum of likelihoods of all paths from xi+1 to N

        • bk(N) = rkE

        • bk(i) =∑hrhk eh(xi) bh(i+1) (i=2,…,N)

        • Ltot = ∑hrBh eh(xi) bh(1)

      • Both Ltot has to be identical

      • ∑ P(i =k) = fk(i) bk(i)/ Ltot

      • ∑iP(i =k, i+1 =j) = fk(i) rkjej(xi+1) bj(i+1)/ Ltot


    Helical/Loop Example

    • M1-M0 model

      • 1st-order transition between hidden states, 0-order independent letters

      • Also, ek(a) is dependent on hidden state k, but not on previous letter

      • Example of M1-M0 model

        • Occasionally dishonest casino, Durbin et al (1998)


    Helical/Loop Example

    • HMM model

      • Krogh et al. (2001)


    Coiled Coil Example

    • Coiled coils are associations of two or more α helices that wrap around each other

      • Found in many proteins, tropomyosin, hemagglutinin (influenza virus), DNA-binding transcription factors

      • About 3.5 residues per turn, leading to a repeating pattterns of seven residues (a heptad) in two turns

      • Two helices attract one another due to hydrophobic residues at sites a and d


    Coiled Coil Example

    • Lupas, Vandyke, and Stock (1991)

      • Developed a profile score system dependent on the relative amino acid frequencies at each site (similar to Lprofile = ∏N pix and S = ln(Lpf/ L0) = ∑Nln(pix / px) in profile model)

      • Used a sliding window of 28 residues (four heptads)

    • Delorenzi and Speed (2002)

      • HMM with 9 groups of states and Beg/End

      • Each of 9 groups contains seven states representing seven possible positions in the helix

        • States in one group are linked to the state at the following helix position in the next group


    Profile HMM

    • Profile technique

      • position-specific scores are used to describe aligned families of protein sequences

      • Drawback is the reliance on ad hoc scoring schemes

    • Profile HMM is developed to capture the info in an alignments

    • 2 3

    • W H . . E n

    • W H . . Y .

    • W - . . E .

    • S H . . E .

    • T H e . Y .

    • W H e r E .


    Neural Networks

    • Simulate human nerve system

      • Neurons and synapse

      • Neuron puts out a real number between 0 and 1

    • Feedforward network

    • Typically 10-20 residues are input

    • Usually used in supervised learning


    Single Neuron

    • Connection from input to a neuron has positive/negative weight wij

      • Total input xj = ∑iwijyi

      • Output yj = g(xj)

      • A sigmoid function: g(xj) = 1/[1 + exp(-xj)]

    • Single output with multiple inputs is called a perceptron


    Perceptron Example

    • Two inputs, one output

    • Trained by

      • Total input xj = w1y1 + w2y2 + w0 (w0 is bias)

      • Assume a step function for g(xj)

    (y1 y2) → y

    (0, ½)1

    (1,1)1

    (1,1/2)0

    (0,0)0

    w2/2 + w0 > 0

    w1+ w2+ w0 > 0

    w1 + w2/2 + w0 < 0

    w0 < 0


    Perceptron Example

    • Visualize

      • Can pick y2 = ¼ + 1/2 y1

        • -1/4 - 1/2 y1 + y2 > 0

      • w1 = -¼, w2 = -1/2, w0 = 1

    w2/2 + w0 > 0

    w1+ w2+ w0 > 0

    w1 + w2/2 + w0 < 0

    w0 < 0


    Learning Algorithm

    • Backpropagation

      • Error at the output layer percolate down to the input layer

      • Weights are adjusted

      • Based on gradient descent method


    NN Application

    • Protein structure prediction by PROF

      • Input layer

        • Sliding 15-residue window

        • Predict secondary structure of the central residue

        • One residue has 20 input nodes

      • Hidden layer

        • Connected to ALL input and output nodes


    NN Application

    • Protein structure prediction by PHD (1993)

      • Based on 250 unique protein chains

        • Based on profile info

        • 20 AA + insertion + deletion + conservation weight

        • 13 columns are used as input

        • Connected to ALL input and output nodes


    NN Application

    • Intron prediction

      • Intron splice site spans 15-60 nt

        • Organisms have unique codon usages at donor sites


  • Login