1.22k likes | 1.59k Views
2d-3D Structure Modelling. S. Shahriar Arab. Flow of information. DNA. RNA. PROTEIN SEQ. PROTEIN STRUCT. PROTEIN FUNCTION. ………. Prediction in bioinformatics. Important prediction problems: Protein sequence from genomic DNA Protein 3D structure from sequence
E N D
2d-3D Structure Modelling • S. Shahriar Arab
Flow of information DNA RNA PROTEIN SEQ PROTEIN STRUCT PROTEIN FUNCTION ……….
Prediction in bioinformatics • Important prediction problems: • Protein sequence from genomic DNA • Protein 3D structure from sequence • Protein function from structure • Protein function from sequence
Why predict protein structure? • The sequence structure gap • Over millions known sequences, 80 000 known structures • Structural knowledge brings understanding of function and mechanism of action • Can help in prediction of function
Why predict protein structure? • Predicted structures can be used in structure based drug design • It can help us understand the effects of mutations on structure or function • It is a very interesting scientific problem • still unsolved in its most general form after more than 20 years of effort
What is protein structure prediction? • In its most general form • a prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge only of the primary structure (sequence)
Methods of structure prediction • Ab initio protein folding approaches • Comparative (homology) modelling • Fold recognition/threading
Prediction in one dimension • Secondary structure prediction • Surface accessibility prediction
2D Structure Identification • DSSP - Database of Secondary Structures for Ps (http://swift.cmbi.kun.nl/gv/dssp/) • VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/) • PDB - Protein Data Bank (www.rcsb.org) QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCA HHHHHHCCEEEEEEEEEEECCHHHHHHHCCCCCCC
- Secondary Structure • The DSSP code • H = alpha helix • B = residue in isolated beta-bridge • E = extended strand, participates in beta ladder • G = 3-helix (3/10 helix) • I = 5 helix (pi helix) • T = hydrogen bonded turn • S = bend • C= coil -
Simplifications • Eight states from DSSP • H: α−helix • G: 310 helix • I: π-helix • E: β−strand • B: bridge • T: β−turn • S: bend • C: coil • CASP Standard • H = (H, G, I), • E = (E, B), • C = (C, T, S) • Identification of secondary structures focused on • α-helices • β -strands • others (turns, coils, other helices) are collectively called “coils”
What is Secondary structure prediction? • Given a protein sequence (primary structure) GHWIATRGQLIREAYEDYRHFSSECPFIP • Predict its secondary structure content • (C=coils H=Alpha Helix E=Beta Strands) CEEEEECHHHHHHHHHHHCCCHHCCCCCC
Why Secondary Structure Prediction? • Simply easier problem than 3D structure prediction • Accurate secondary structure prediction can be an important information for the tertiary structure prediction • Improving alignment accuracy • Protein function prediction • Protein classification
secondary structure prediction • less detailed results • only predicts the H (helix), E (extended) or C (coil/loop) state of each residue, does not predict the full atomic structure • Accuracy of secondary structure prediction • The best methods have an average accuracy of just about 73% (the percentage of residues predicted correctly)
History of protein secondary structure prediction • First generation • How: single residue statistics • Example: Chou-Fasman method, LIM method, GOR I, etc • Accuracy: low • Secondary generation • How: segment statistics • Examples: ALB method, GOR III, etc • Accuracy: ~60% • Third generation • How: long-range interaction, homology based • Examples: PHD • Accuracy: ~70%
Chou-Fasman Method • Developed by Chou & Fasman in 1974 & 1978 • Based on frequencies of residues in α-helices, β-sheets and turns • Accuracy ~50 - 60% Q3
Chou-Fasman statistics • R – amino acid, S- secondary structure • f(R,S) – number of occurrences of R in S • Ns – total number of amino acids in conformation S • N – total number of amino acids • P(R,S) – propensity of amino acid R to be in structure S • P(R,S) = (f(R,S)/f(R))/(Ns/N)
Example • #residues=20,000, • #helix=4,000, • #Ala=2,000, • #Ala in helix=500 • f(Ala, α) = 500/20,000, • f(Ala) = 2,000/20,000 • p(α) = Να/Ν=4,000/20,000 • P = (500/2000) / (4,000/20000) = 1.25
Scan peptide for α−helix regions 2. Identify regions where 4/6 have a P(H) >100 “alpha-helix nucleus”
Extend α-helix nucleus 3. Extend helix in both directions until a set of four residues have an average P(H) <100. Repeat steps 1 – 3 for entire peptide
Scan peptide for β-sheet regions 4. Identify regions where 3/5 have a P(E) >100 “β-sheet nucleus” 5. Extend β-sheet until 4 continuous residues an have an average P(E) < 100 6. If region average > 105 and the average P(E) > average P(H) then “β-sheet”
The GOR method • developed by Garnier, Osguthorpe& Robson • build on Pij values based on information theory • evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues • sliding window of 17 • GOR III method accuracy ~64% Q3
GOR idea: Statistics that take into account the whole window • Each residue caries two different types of information: • Intra-residue information – information about it’s own secondary structure • Inter-residue information – the influence of this residue on other residue
GOR….continued • Individual propensity of amino acid R to be in secondary structure S.– same idea as in Chou – Fasman • Contribution of 16 neighbors. • - take the window of radius 8 around the residue in question (8 before and 8 after the residue) • - for each residue in the window consider it’s contribution to the conformation of the middle residue and this it’s value to PH, PS, PC. • -Like in Chou-Fasman the values of all contributions are based on statistics.
Nearest Neighbour Method • Idea: similar sequences are likely have same secondary structure. • Take a window around amino acid the conformation of which is to be predicted • Find several, say k, closest sequences (with respect to a similarity measure defined differently depending on the variant of the method) of known structure. • Assign secondary structure based on conformation of the sequence neighbours. • Use max (nα, nβ, nc) or max(sα, sβ, sc) • Key: Scoring measure of evolutionary similarity. • Salamov, Solovyev NNSSP (1995) accuracy above 70%
Neighbours 1 - LH H H H H HL L - S1 2 - LL H H H H HL L - S2 3 - L E E E E E E L L - S3 4 - L E E E E E E L L - S4 n - LL L L E E E E E - Sn n+1 - HH H L L LE E E - Sn+1 : • max (nα, nβ, nL) or max (Σsα, Σsβ, ΣsL) or something else…
Advantages • Information from structural neighbours can be used to provide details to predicted secondary structure (phi,psi angles) • Much higher accuracy than previous methods.
Neural network models • machine learning approach • provide training sets of structures (e.g. α-helices, non α -helices) • computers are trained to recognize patterns in known secondary structures • provide test set(proteins with known structures) • accuracy ~ 70 –75%
Neural Network Method Recall artificial neurone:
How PHD works • Step 1. BLAST search with input sequence • Step 2. Perform multiple seq. alignment and calculate aa frequencies for each position
How PHD works (cont.) • Step3. Level 1: sequence to structure • Take window of 13 adjacent residues • Scores for helix, strand, loop in the output layer, for each residue
Prediction tools that use NNs • MACMATCH • (Presnell et al., 1993) • for Macintoch • PHD • - (Rost & Sander, 1993) • http://www.predictprotein.org/ • NNPREDICT • (Kneller et al. 1990) • http://www.cmpharm.ucsf.edu/nomi/nnpredict.html
Best of the Best • PredictProtein-PHD (72%) • http://www.predictprotein.org/ • Jpred (73-75%) • http://jura.ebi.ac.uk:8888/ • PREDATOR (75%) • http://www.embl-heidelberg.de/cgi/predator_serv.pl • PSIpred (77%) • http://insulin.brunel.ac.uk/psipred
Solvent Probe Accessible Surface Reentrant Surface Van der Waals Surface Accessible Surface Area
ASA Calculation • DSSP - Database of Secondary Structures for Proteins (swift.embl-heidelberg.de/dssp) • VADAR - Volume Area Dihedral Angle Reporter (http://redpoll.pharmacy.ualberta.ca/vadar/) • GetArea - www.scsb.utmb.edu/getarea/area_form.html QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE 1056298799415251510478941496989999999
Other ASA sites • Connolly Molecular Surface Home Page • http://www.biohedron.com/ • Naccess Home Page • http://sjh.bi.umist.ac.uk/naccess.html • ASA Parallelization • http://cmag.cit.nih.gov/Asa.htm • Protein Structure Database • http://www.psc.edu/biomed/pages/research/PSdb/
Accessibility • Accessible Surface Area (ASA) • in folded protein • Accessibility = • Maximum ASA • Two state = b(buried) ,e(exposed) • e.g. b<= 16% e>16% • Three state = b(buried),I(intermediate), e(exposed) • e.g. b<=16% 16%>i,<36% e>36%
QHTAW... QHTAWCLTSEQHTAAVIW BBPPBEEEEEPBPBPBPB Accessibility Prediction • PredictProtein-PHDacc(58%) • http://cubic.bioc.columbia.edu/predictprotein • PredAcc (70%?) • http://condor.urbb.jussieu.fr/PredAccCfg.html
New folds Existing folds Building by homology Ab initio prediction Threading 0 10 20 30 40 50 60 70 80 90 100 similarity (%) 3D structure prediction of proteins
Choice of prediction methods • If you can find similar sequences of known structure then comparative modelling is the best way to predict structure • all other methods are less reliable • Of course, you can’t always find similar sequences of known structure.
When you can’t do comparative modelling? • Secondary structure prediction • Fold recognition/threading • Ab initio protein folding approaches
Divergent evolution • Different proteins in different organisms have diverged from a common ancestor protein • Each copy of this ancestor in various organisms has been subject to mutations, deletions, and insertions of amino acids in its sequence • In general, its 3-D fold and function have remained similar