Neural networks for protein structure prediction brown jmb 1999
Download
1 / 35

Neural Networks for Protein Structure Prediction Brown, JMB 1999 - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Neural Networks for Protein Structure Prediction Brown, JMB 1999. CS 466 Saurabh Sinha. Outline. Goal is to predict “secondary structure” of a protein from its sequence Artificial Neural Network used for this task Evaluation of prediction accuracy. What is Protein Structure?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Neural Networks for Protein Structure Prediction Brown, JMB 1999' - akio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Neural networks for protein structure prediction brown jmb 1999

Neural Networks for Protein Structure PredictionBrown, JMB 1999

CS 466

Saurabh Sinha


Outline
Outline

  • Goal is to predict “secondary structure” of a protein from its sequence

  • Artificial Neural Network used for this task

  • Evaluation of prediction accuracy



http://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm


http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.pnghttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png


Protein structure
Protein Structurehttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • An amino acid sequence “folds” into a complex 3-D structure

  • Finding out this 3-D structure is a crucial and challenging task

  • Experimental methods (e.g., X-ray crystallography) are very tedious

  • Computational predictions are a possibility, but very difficult


What is secondary structure
What is “secondary structure”?http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png


“Strand”http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

“Helix”

http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif


“Helix”http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

“Strand”

http://www.npaci.edu/features/00/Mar/protein.jpg


Secondary structure prediction
Secondary structure predictionhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Well, the whole 3-D “tertiary” protein structure may be hard to predict from sequence

  • But can we at least predict the secondary structural elements such as “strand”, “helix” or “coil”?

  • This is what this paper does

  • .. and so do many other papers (it is a hard problem !)


A survey of structure prediction
A survey of structure predictionhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • The most reliable technique is “comparative modeling”

    • Find a protein P whose amino acid sequence is very similar to your “target” protein T

    • Hope that this other protein P does have a known structure

    • Predict a similar structure similar to that of P, after carefully considering how the sequences of P and T differ


A survey of structure prediction1
A survey of structure predictionhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Comparative modeling fails if we don’t have a suitable homologous “template” protein P for our protein T

  • “Ab initio” tertiary methods attempt to predict the structure without using a protein structure

    • Incorporate basic physical and chemical principles into the structure calculation

    • Gets very hairy, and highly computationally intensive

  • The other option is prediction of secondary structure only (i.e., making the goal more modest)

    • These may be used to provide constraints for tertiary structure prediction


Secondary structure prediction1
Secondary structure predictionhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Early methods were based on stereochemical principles

  • Later methods realized that we can do better if we use not only the one sequence T (our sequence), but also a family of “related sequences”

  • Search for sequences similar to T, build a multiple alignment of these, and predict secondary structure from the multiple alignment of sequence


What s multiple alignment doing here
What’s multiple alignment doing here ?http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Most conserved regions of a protein sequence are either functionally important or buried in the protein “core”

  • More variable regions are usually on surface of the protein,

    • there are few constraints on what type of amino acids have to be here (apart from bias towards hydrophilic residues)

  • Multiple alignment tells us which portions are conserved and which are not


hydrophobic corehttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png


What s multiple alignment doing here1
What’s multiple alignment doing here ?http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Therefore, by looking at multiple alignment, we could predict which residues are in the core of the protein and which are on the surface (“solvent accessibility”)

  • Secondary structure then predicted by comparing the accessibility patterns associated with helices, strands etc.

  • This approach (Benner & Gerloff) mostly manual

  • Today’s paper suggest an automated method


The psi pred algorithm
The PSI-PRED algorithmhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Given an amino-acid sequence, predict secondary structure elements in the protein

  • Three stages:

  • Generation of a sequence profile (the “multiple alignment” step)

  • Prediction of an initial secondary structure (the neural network step)

  • Filtering of the predicted structure (another neural network step)


Generation of sequence profile
Generation of sequence profilehttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • A BLAST-like program called “PSI-BLAST” used for this step

  • We saw BLAST earlier -- it is a fast way to find high scoring local alignments

  • PSI-BLAST is an iterative approach

    • an initial scan of a protein database using the target sequence T

    • align all matching sequences to construct a “sequence profile”

    • scan the database using this new profile

  • Can also pick out and align distantly related protein sequences for our target sequence T


The sequence profile looks like this
The sequence profile looks like thishttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Has 20 x M numbers

  • The numbers are log likelihood of each residue at each position


Preparing for the second step
Preparing for the second stephttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Feed the sequence profile to an artificial neural network

  • But before feeding, do a simply “scaling” to bring the numbers to 0-1 scale


Intro to neural nets the second and third steps of psipred
Intro to Neural nets http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png(the second and third steps of PSIPRED)


Artificial neural network
Artificial Neural Networkhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Supervised learning algorithm

  • Training examples. Each example has a label

    • “class” of the example, e.g., “positive” or “negative”

    • “helix”, “strand”, or “coil”

  • Learns how to predict the class of an example


Artificial neural network1
Artificial Neural Networkhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Directed graph

  • Nodes or “units” or “neurons”

  • Edges between units

  • Each edge has a weight (not known a priori)


Layered architecture
Layered Architecturehttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

http://www.akri.org/cognition/images/annet2.gif

Input here is a four-dimensional vector. Each dimension goes

into one input unit


Layered architecture1
Layered Architecturehttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

http://www.geocomputation.org/2000/GC016/GC016_01.GIF

(units)


What a unit neuron does
What a unit (neuron) doeshttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Unit i receives a total input xi from the units connected to it, and produces an output yi = fi(xi) where fi() is the “transfer function” of unit i

wi is called the “bias” of the unit


Weights bias and transfer function
Weights, bias and transfer functionhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

Unit takes n inputs

Each input edge has weight wi

Bias b

Output a

Transfer function f()

Linear, Sigmoidal, or other


Weights bias and transfer function1
Weights, bias and transfer functionhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Weights wij and bias wi of each unit are “parameters” of the ANN.

    • Parameter values are learned from input data

  • Transfer function is usually the same for every unit in the same layer

  • Graphical architecture (connectivity) is decided by you.

    • Could use fully connected architecture: all units in one layer connect to all units in “next” layer


Where s the algorithm
Where’s the algorithm?http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • It’s in the training of parameters !

  • Given several examples and their labels: the training data

  • Search for parameter values such that output units make correct predictions on the training examples

  • “Back-propagation” algorithm

    • Read up more on neural nets if you are interested


Back to psipred
Back to PSIPRED …http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png


Step 2
Step 2http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Feed the sequence profile to the input layer of an ANN

  • Not the whole profile, only a window of 15 consecutive positions

  • For each position, there are 20 numbers in the profile (one for each amino acid)

  • Therefore ~ 15 x 20 = 300 numbers fed

  • Therefore, ~ 300 “input units” in ANN

  • 3 output units, for “strand”, “helix”, “coil”

    • each number is confidence in that secondary structure for the central position in the window of 15


e.g.,http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

0.18

0.09

0.67

helix

strand

15

coil

Input layer

Hidden

layer


Step 3
Step 3http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Feed the output of 1st ANN to the 2nd ANN

  • Each window of 15 positions gave 3 numbers from the 1st ANN

  • Take 15 successive windows’ outputs and feed them to 2nd ANN

  • Therefore, ~ 15 x 3 = 45 input units in ANN

  • 3 output units, for “strand”, “helix”, “coil”


Test of performance
Test of performancehttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png


Cross validation
Cross-validationhttp://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png

  • Partition the training data into “training set” (two thirds of the examples) and “test set” (remaining one third)

  • Train PSIPRED on training set, test predictions and compare with known answers on test set.

  • What is an answer?

    • For each position of sequence, a prediction of what secondary structure that position is involved in

    • That is, a sequence over “H/S/C” (helix/strand/coil)

  • How to compare answer with known answer?

    • Number of positions that match


ad