Secondary Structure Prediction Using Decision Lists

Secondary Structure Prediction Using Decision Lists Deniz YURET Volkan KURT

Outline • What is the problem? • What are the different approaches? • How do we use decision lists and why? • Why does evolution help?

What is the problem? • The generic prediction algorithm • Some important pitfalls: definition, data set • Upper and lower bounds on performance • Evolution and homology enters the picture

Tertiary / Quaternary Structure

Secondary Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm • Sequence to Structure • Structure to Structure

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ??????????????????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?????????????????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --????????????????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ---???????????????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----????????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----H???????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HH??????????????????????????

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE------?

A Generic Prediction Algorithm: Sequence to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ?---H-----HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----H-----HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD -?--H-----HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD --?-H-----HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----?-----HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE------?

A Generic Prediction Algorithm: Structure to Structure MRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGD ----------HHHHHHHHHH------EEEEE-------

Pitfalls for newcomers • Definition of secondary structure • Choice of data set

Pitfall 1: Definition of Secondary Structure • DSSP: H, P, E, G, I, T, S • STRIDE: H, G, I, E, B, b, T, C • DEFINE: ??? • Convert all to H, --, and E • They only agree 71% of the time!!! (95% for DSSP and STRIDE) • Solution: Use DSSP

Pitfall 2: Dataset • Trivial to get 80%+ when homologies are present between the training and the test set • Homology identification keeps evolving • RS126, CB513, etc. • Comparison of programs on different data sets meaningless…

Performance Bounds • Simple baselines for lower bound • A method for estimating an upper bound

Baseline 1: 43% of all residues are tagged “loop” Performance Bounds 43%: assign loop

Baseline 2: 49% of all residues are tagged with the most frequent structure for the given amino-acid. Performance Bounds 49%: assign most frequent 43%: assign loop

Upper bound: Only consider exact matches for a given frame size. As the frame size increases accuracy should increase but coverage should fall. Performance Bounds 100% ??? 49%: assign most frequent 43%: assign loop

Upper Bound with Homologs

Upper Bound without Homologs

Upper bound: Only consider exact matches for a given frame size. As the frame size increases accuracy should increase but coverage should fall. Performance Bounds 100% ??? 75%: estimated upper bound 49%: assign most frequent 43%: assign loop

The Miracle of Homology • People used to be stuck at around 60%. • Rost and Sander crossed the 70% barrier in 1993 using homology information. • All algorithms benefit 5-10% from homology. • The homologues are of unknown structure, training and test sets still unrelated! • Why?

The Miracle of Homology 60%

The Miracle of Homology 70%

GORV Sequence Secondary Structure PSI-BLAST +6.5% 66.9% Majority Vote Information Function / Bayesian Statistics Filter Secondary Structure Secondary Structure +73.4% * Garnier et al, 2002

Frequency Profile HSSP Neural Network Secondary Structure PHD Secondary Structure +4.3% Neural Network 62.6% / 67.4% Jury + Filter +3.4% Secondary Structure 70.8% 61.7% / 65.9% * Rost & Sander, 1993

JNet Profile Secondary Structure PSIBLAST HMMER2 CLUSTALW Neural Network Neural Network Jury + Jury Network Secondary Structure Secondary Structure 76.9% * Cuff & Barton, 2000

PSIPRED Secondary Structure Profiles PSI-BLAST Neural Network Neural Network Secondary Structure Secondary Structure 76.3% * Jones, 1999

Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. Class Name: 2 (democrat, republican) 1. handicapped-infants: 2 (y,n) 2. water-project-cost-sharing: 2 (y,n) 3. adoption-of-the-budget-resolution: 2 (y,n) 4. physician-fee-freeze: 2 (y,n) 5. el-salvador-aid: 2 (y,n) 6. religious-groups-in-schools: 2 (y,n) … 16. export-administration-act-south-africa: 2 (y,n)

Introduction to Decision Lists • Prototypical machine learning problem: • Decide democrat or republican for 435 representatives based on 16 votes. 1. If adoption-of-the-budget-resolution = y and anti-satellite-test-ban = n and water-project-cost-sharing = y then democrat 2. If physician-fee-freeze = y then republican 3. If TRUE then democrat

Secondary Structure Prediction Using Decision Lists

Secondary Structure Prediction Using Decision Lists

Presentation Transcript

Lecture 14 Secondary Structure Prediction

RNA Secondary Structure Prediction

Protein Secondary Structure Prediction

RNA Secondary Structure Prediction

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction

RNA Secondary Structure Prediction

Secondary structure prediction

Protein Secondary Structure Prediction PSSP

Lecture 10 Secondary Structure Prediction

RNA secondary structure prediction

Secondary Structure Prediction

Secondary Structure Prediction

RNA Secondary Structure Prediction

Secondary Structure Prediction

Protein Secondary Structure Prediction

Secondary Structure Prediction

Protein secondary structure Prediction

Secondary Structure Prediction (Mostly RNA)

Protein Secondary Structure Prediction

Improved Protein Secondary Structure Prediction

Protein Secondary Structure Prediction

Sea Ice

Sea Ice