Secondary structure prediction using decision lists
This presentation is the property of its rightful owner.
Sponsored Links
1 / 104

Secondary Structure Prediction Using Decision Lists PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Secondary Structure Prediction Using Decision Lists. Deniz YURET Volkan KURT. Outline. What is the problem? What are the different approaches? How do we use decision lists and why? Why does evolution help?. What is the problem?. The generic prediction algorithm

Download Presentation

Secondary Structure Prediction Using Decision Lists

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Secondary structure prediction using decision lists

Secondary Structure Prediction Using Decision Lists

Deniz YURET

Volkan KURT


Outline

Outline

  • What is the problem?

  • What are the different approaches?

  • How do we use decision lists and why?

  • Why does evolution help?


What is the problem

What is the problem?

  • The generic prediction algorithm

  • Some important pitfalls: definition, data set

  • Upper and lower bounds on performance

  • Evolution and homology enters the picture


Tertiary quaternary structure

Tertiary / Quaternary Structure


Tertiary quaternary structure1

Tertiary / Quaternary Structure


Secondary structure

Secondary Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----------HHHHHHHHHH------EEEEE-------


A generic prediction algorithm

A Generic Prediction Algorithm

  • Sequence to Structure

  • Structure to Structure


A generic prediction algorithm sequence to structure

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

??????????????????????????????????????


A generic prediction algorithm sequence to structure1

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

-?????????????????????????????????????


A generic prediction algorithm sequence to structure2

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

-?????????????????????????????????????


A generic prediction algorithm sequence to structure3

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

--????????????????????????????????????


A generic prediction algorithm sequence to structure4

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

--????????????????????????????????????


A generic prediction algorithm sequence to structure5

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

---???????????????????????????????????


A generic prediction algorithm sequence to structure6

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----????????????????????????????


A generic prediction algorithm sequence to structure7

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----H???????????????????????????


A generic prediction algorithm sequence to structure8

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----H???????????????????????????


A generic prediction algorithm sequence to structure9

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----HH??????????????????????????


A generic prediction algorithm sequence to structure10

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----HHHHHHHHHH------EEEEE------?


A generic prediction algorithm sequence to structure11

A Generic Prediction Algorithm: Sequence to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

?---H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure1

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure2

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

-?--H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure3

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure4

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

--?-H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure5

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----H-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure6

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----?-----HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure7

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----------HHHHHHHHHH------EEEEE-------


A generic prediction algorithm structure to structure8

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----------HHHHHHHHHH------EEEEE------?


A generic prediction algorithm structure to structure9

A Generic Prediction Algorithm: Structure to Structure

MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD

----------HHHHHHHHHH------EEEEE-------


Pitfalls for newcomers

Pitfalls for newcomers

  • Definition of secondary structure

  • Choice of data set


Pitfall 1 definition of secondary structure

Pitfall 1: Definition of Secondary Structure

  • DSSP: H, P, E, G, I, T, S

  • STRIDE: H, G, I, E, B, b, T, C

  • DEFINE: ???

  • Convert all to H, --, and E

  • They only agree 71% of the time!!!

    (95% for DSSP and STRIDE)

  • Solution: Use DSSP


Pitfall 2 dataset

Pitfall 2: Dataset

  • Trivial to get 80%+ when homologies are present between the training and the test set

  • Homology identification keeps evolving

  • RS126, CB513, etc.

  • Comparison of programs on different data sets meaningless…


Performance bounds

Performance Bounds

  • Simple baselines for lower bound

  • A method for estimating an upper bound


Performance bounds1

Baseline 1: 43% of all residues are tagged “loop”

Performance Bounds

43%: assign loop


Performance bounds2

Baseline 2: 49% of all residues are tagged with the most frequent structure for the given amino-acid.

Performance Bounds

49%: assign most frequent

43%: assign loop


Performance bounds3

Upper bound: Only consider exact matches for a given frame size.

As the frame size increases accuracy should increase but coverage should fall.

Performance Bounds

100% ???

49%: assign most frequent

43%: assign loop


Upper bound with homologs

Upper Bound with Homologs


Upper bound without homologs

Upper Bound without Homologs


Performance bounds4

Upper bound: Only consider exact matches for a given frame size.

As the frame size increases accuracy should increase but coverage should fall.

Performance Bounds

100% ???

75%: estimated upper bound

49%: assign most frequent

43%: assign loop


The miracle of homology

The Miracle of Homology

  • People used to be stuck at around 60%.

  • Rost and Sander crossed the 70% barrier in 1993 using homology information.

  • All algorithms benefit 5-10% from homology.

  • The homologues are of unknown structure, training and test sets still unrelated!

  • Why?


The miracle of homology1

The Miracle of Homology

60%


The miracle of homology2

The Miracle of Homology

70%


Outline1

Outline

  • What is the problem?

  • What are the different approaches?

  • How do we use decision lists and why?

  • Why does evolution help?


Secondary structure prediction using decision lists

GORV

Sequence

Secondary

Structure

PSI-BLAST

+6.5%

66.9%

Majority Vote

Information

Function / Bayesian

Statistics

Filter

Secondary

Structure

Secondary

Structure

+73.4%

* Garnier et al, 2002


Secondary structure prediction using decision lists

Frequency Profile

HSSP

Neural Network

Secondary

Structure

PHD

Secondary

Structure

+4.3%

Neural Network

62.6% / 67.4%

Jury +

Filter

+3.4%

Secondary

Structure

70.8%

61.7% / 65.9%

* Rost & Sander, 1993


Secondary structure prediction using decision lists

JNet

Profile

Secondary

Structure

PSIBLAST

HMMER2

CLUSTALW

Neural Network

Neural Network

Jury +

Jury Network

Secondary

Structure

Secondary

Structure

76.9%

* Cuff & Barton, 2000


Psipred

PSIPRED

Secondary

Structure

Profiles

PSI-BLAST

Neural Network

Neural Network

Secondary

Structure

Secondary

Structure

76.3%

* Jones, 1999


Outline2

Outline

  • What is the problem?

  • What are the different approaches?

  • How do we use decision lists and why?

  • Why does evolution help?


Introduction to decision lists

Introduction to Decision Lists

  • Prototypical machine learning problem:

    • Decide democrat or republican for 435 representatives based on 16 votes.

Class Name: 2 (democrat, republican)

1. handicapped-infants: 2 (y,n)

2. water-project-cost-sharing: 2 (y,n)

3. adoption-of-the-budget-resolution: 2 (y,n)

4. physician-fee-freeze: 2 (y,n)

5. el-salvador-aid: 2 (y,n)

6. religious-groups-in-schools: 2 (y,n)

16. export-administration-act-south-africa: 2 (y,n)


Introduction to decision lists1

Introduction to Decision Lists

  • Prototypical machine learning problem:

    • Decide democrat or republican for 435 representatives based on 16 votes.

1. If adoption-of-the-budget-resolution = y

and anti-satellite-test-ban = n

and water-project-cost-sharing = y

then democrat

2. If physician-fee-freeze = y

then republican

3. If TRUE then democrat


The greedy prepend algorithm

The Greedy Prepend Algorithm


Rule search

+ -

Rule Search

  • Initially evertyhing is predicted to be the mostly seen structure (i.e. loop)

False Assignments

Correct

Assignments

Training Set

Partition with respect to the Base Rule


Rule search1

+ -

Rule Search

  • At each step add the maximum gain rule

+

-

-

+

Partition with respect to the Second Rule

Partition with respect to the Base Rule


Gpa rule s

GPA Rules

  • The first three rules of the sequence-to-structure decision list

    • 58.86% performance (of 66.36%)


Gpa rule 1

GPA Rule 1

  • Everything => Loop


Gpa rule 2

GPA Rule 2


Gpa rule 3

GPA Rule 3


Secondary structure prediction using decision lists

GPA

Sequence

Secondary

Structure

PSI-Blast

+6.67%

GPA

GPA

Secondary

Structure

Secondary

Structure

60.48%

62.54% / 69.21%


Experimental setup

Experimental Setup

  • DSSP assignments

  • Reduction:

    • E (extended strand), B (b bridge)-> Strand

    • H (a helix ), G (3-10 helix) -> Helix

    • Others -> Loop

  • Data set:

    • CB513 set

    • 7-fold cross-validation


Gpa performance

GPA Performance

  • Performance of seq-to-struct decision list:

    • Without homologs: 60.48% (29 to 66 rules)

    • With homologs: 66.36% (46 to 68 rules)

  • Performance withstruct-to-structfilter:

    • Without homologs: 62.54% (18 to 116 rules)

    • With homologs: 69.21% (16 to 40 rules)


Gpa performance1

GPA Performance

  • Performance at 20 rules at both steps:

    • Without homologs: 62.15%

    • With homologs: 69.08%

  • Possible to make a back-of-the-envelope structure prediction using our model


Comparison on cb513

Comparison on CB513

  • PhD72.3

  • NNSSP71.7

  • GPA69.2

  • DSC69.1

  • Predator69.0


Outline3

Outline

  • What is the problem?

  • What are the different approaches?

  • How do we use decision lists and why?

  • Why does evolution help?


The miracle of homology3

The Miracle of Homology

70%


Discussion

Discussion

  • Training set homologues and test set homologues help for different reasons.

  • Training set homologues use semi-accurate guesses of structure to provide information on amino-acid substitutions

  • Test set homologues take advantage of “independent errors” in prediction

  • The less similar the homologue sequences the better…


Summary

Summary

  • Homologues between the training set and the test set unfairly influence results.

  • Homologues within the training set and the test set still help significantly.

  • There is an upper bound at around 75% unless we use a homologue of the target protein.

  • Very different learning algorithms converge on comparable accuracy.


Some educated guesses

Some Educated Guesses

  • Significant progress probably requires better homology detection rather than better learning algorithms.

  • To exceed the 75% bound one needs to start incorporating long range interactions.

  • CASP shows predicting tertiary structure first gives compatible results – any use for secondary structure?


Thank you

Thank you…

  • The algorithm, the paper, etc. available from:

    [email protected]


Introduction

Introduction

  • Protein Structure

    • What is Secondary Structure?

    • What is Tertiary Structure?

  • Secondary structure Prediction

    • What are decision lists?

    • GPA in Action

  • Tertiary Structure Prediction


Protein structure

Protein Structure

  • Primary Structure

    • Sequences

  • Secondary Structure

    • Frequent Motifs

  • Tertiary Structure

    • Functional Form

  • Quaternary Structure

    • Protein complexes


Primary structure

Primary Structure

  • Sequence information

  • Contains only aminoacid sequences

    • 24 amino acid codes present

    • 20 standard residues

    • Glutamine or Glutamic Acid  GLX (GLU)

    • Asparagine or Aspartic Acid  ASX (ASN)

    • Others (Non-natural/Unknown)  X

      • Selenocysteine, Pyrrolysine


Secondary structure1

Secondary Structure

  • Rigid structure motifs

  • Do not give information about coordinates of residues

  • Can be seen as a one-dimensional reduction of the tertiary structure

  • If accurately predicted, can be used to

    • Predict the final (tertiary) structure

    • Predict the fold type (all-alpha/all-beta etc.)


Common secondary structure motifs

Common Secondary Structure Motifs

Parallel beta-sheet

alfa-helix

Antiparallel beta-sheet


Tertiary quaternary structure2

Tertiary/Quaternary Structure

  • Tertiary Structure

    • The functional form

    • Coordinates of residues in the space

  • Quaternary Structure

    • Protein – Protein complexes

    • Assembly of one or more proteins


Structure prediction

Structure Prediction

  • Easier to determine sequence than structure

  • Predictions may help close the gap


Secondary structure prediction

Secondary Structure Prediction

  • Assesment of Prediction Accuracy

  • Common Strategy

  • Methods in Literature

  • Decision Lists

    • Prediction using GPA

  • A Performance Bound


Secondary structure prediction1

Secondary Structure Prediction

  • Predictions based on

    • Sequence Information

    • Multiple Sequence Alignments

  • Various algorithms present based on

    • Information Theory

    • Machine Learning

    • Neural Networks etc.


Assessment of accuracy

Assessment of Accuracy

  • Determination method

    • DSSP

  • Performance Metric

    • Q3 accuracy

    • Three state accuracy (helix/strand/loop)

  • Data set selection

    • Non-redundancy

  • Homology Information

    • Multiple Sequence Alignments

  • Cross-Validation


Two levels of prediction

Two Levels of Prediction

Sequence

  • First Level:

    • Sequence to Structure

  • Input:

    • Sequence Information

    • Multiple Sequence Alignments

  • Method:

    • Machine Learning

    • Neural Networks

  • Output

    • Secondary Structure

MSA

Sequence to

Structure

Secondary

Structure


Two levels of prediction1

Two Levels of Prediction

Secondary

Structure

  • Second Level:

    • Structure to Structure

  • Input:

    • Structure Information

  • Method:

    • Machine Learning

    • Neural Networks

    • Filter

      • Simple Filters

      • Jury Decisions

  • Output

    • Secondary Structure

Structure to

Structure

Filter

Secondary

Structure


Decision lists

Decision Lists

  • Machine Learning method

  • Simply, a list of rules

  • Each rule asserts a guess

  • Generalization by simple rule pruning

  • Output is human readable/understandable


Secondary structure prediction using decision lists

GPA

  • Greedy Decision List

  • Start with a global (base) rule

  • At every step

    • Find the maximum gain rule

    • Append to previous list

  • Stop when gain change is 0


Data representation

Data Representation

  • Frames of length W

    • Context of an aminoacid is represented by W residues

    • (W-1)/2 to the left. (W-1)/2 to the right

    • If the frame exceeds terminii, they are represented as NAN

    • GLX = GLN. ASX = ASN.

    • New found/Non Natural aa’s = X


Sample data

helix

Sample Data

  • evealekkv[aaLes]vqalekkvealehg

  • Frame Size = 5

  • Represents the features used in the prediction of secondary structure for L (leucine)


2 level algorithm

2-level Algorithm

  • Sequence to Structure List

    • Find the first rule that matches the data point

    • Assign the output of that rule

    • A frame of 9 residues is input

    • Output: Secondary Structure

  • Structure to Structure List

    • After all predictions are made, check for possible improvements

    • A frame of 19 secondary structures is input

    • Output: Secondary Structure


Gpa phd gorv

GPA/PHD/GORV


Discussion why gpa

Discussion - Why GPA?

  • Amazingly simple models

    • With as low as 20 rules in the first level and as low as 20 rules in the second

  • Rules (Models) are human-readable

    • Biological rules may be inferred

  • Second level decision list may be used as a filter for other algorithms


A performance bound claim

A Performance Bound Claim

  • Using only sequence information. the highest achievable performance has an upper bound

  • The lower bound:

    • 43%. with everything assigned as loop

    • 49%. with every residue assigned the most probable structure

  • The upper bound

    • 75%. with non-homologous data


A performance bound claim1

A Performance Bound Claim

  • Bound is calculated by:

    • Taking only the exact sequence matches in the training and testing sets

    • Assign the mostly seen value of that frame in the training set as guess

    • Compare with actual value

  • A bound for non-homologous training and testing sets

  • A bound for carefully selected frame size

    • Not too short (assignments would be almost random)

    • Not too long (only unique frames will be available)


Upper bound with homologs1

Upper Bound with Homologs


Upper bound without homologs1

Upper Bound without Homologs


Tertiary structure prediction

Tertiary Structure Prediction

  • Predictions based on backbone dihedral angles

    • Phi and Psi angles fully define the tertiary structure

  • Goal:

    • Discover the right level of granularity


Data set selection

Data Set Selection

  • PDB-Select

    • A set of non-homologous proteins of high resolution [Hobohm & Sander, 1994]

  • Data representation

    • Frames of 9 residues

    • Residue names plus residue properties

      • Hydrophobicity, polarity, volume, charge etc.

  • Train/Validation/Test


Data discretization

Data Discretization

  • Phi/Psi angles are continuous

    • We need a discrete representation to predict them in a decision list

  • Split the (-180, 180) region into bins

  • Split the Ramachandran into bins


Ramachandran plot 1

Ramachandran Plot (1)


Ramachandran plot 2

Ramachandran Plot (2)

* Karplus, 1996


How to predict

How to Predict?

  • Predictions using sequence information

    • No homology information

  • Predicted angles may be incorporated

    • Upper bounds will be given

  • Accuracy

    • Percent of correct estimates

    • RMSD of phi and psi angles


Using predicted angles

Using Predicted Angles


Performance accuracy

Performance: Accuracy


Performance rmsd

Performance: RMSD


Performance backbone rmsd

Performance: Backbone RMSD


Performance input features

Performance: Input Features


Performance real prediction

Performance: Real Prediction


Future work

Future Work

  • For tertiary structure predictions.

    • The two-leveled approach may be applied to tertiary structure predictions

    • Homology information may be incorporated

  • For secondary structure predictions.

    • Should find better homologues and better representations

    • Incorporating sequence and homology information in the structure to structure part may be an option

  • For both predictions

    • A reliability index for predicted structure


  • Login