Selecting Biomolecules for Study: Active Learning Algorithm in Protein Structure Prediction

An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Presented by Thahir P. Mohamed Advancing Practice, Instruction & Innovation through Informatics October 19-23, 2008

2 Protein Structure Primary Structure: Chain of amino acids Secondary Structure: Sub-structures such as helixes and strands Tertiary Structure:Atomic resolution of protein structure Protein structure is essential for successful design of drugs

3 Challenges in Protein Structure Prediction • X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure. • Very expensive • Very time consuming • Computational techniques are applied to predictprotein structure

4 Computational Protein Structure Prediction • Machine Learning techniques applied to predict structure • Experimentally determined structures are used to learnto predict new structures • When not enough data to learn from: • Active learning is applied to select the next protein to be studied experimentally

5 Active Learning Unlabeled Proteins Possible Labels:

Active Learning Clustered Protiens Possible Labels: Cluster Unlabeled Proteins

7 Selection Algorithm Active Learning Clustered Proteins Possible Labels: Cluster Unlabeled Proteins

8 Selection Algorithm Active Learning Clustered Proteins Possible Labels: Cluster Unlabeled Proteins

9 Selection Algorithm Prediction Active Learning Labeled Protiens Possible Labels: Cluster Unlabeled Proteins Active learning guides selection of data points for which you ask for labels

Membrane Protein Structure Prediction Membrane Protein importance and challenges 10 Membrane Proteins: • 30% of genes • cell regulation and signaling pathways • 60% of drug targets Yet, • Difficult to study experimentally • 1% of known protein structures Active learning can be used as a tool against the limited number of known MP structures despite the large number of known MP sequences

‘Features’ Representation 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 Residue: A L H W R A A G A A T V L L V I V E R G A P G A Q L I Topology: - - - - - M M M M M M M M M M M M - - - - - - - - - - Charge: - - p – p - - - - - - - - - - - - n p - - - - - - - - E-Prop: D d . . A D D . D D a d d d d d d D A . D D . D a d d Properties Charge Size Polarity Aromaticity Electronic Properties Data reduction is performed by SVD, resulting in a final 4 features per window.

Dim 3 Dim 1 Dim 2 Clustering the Data • Neural Network Self Organizing Map (SOM) • Finds centroids of clusters in the data

Design 1:Density-based Selection • Find the most dense cluster • Choose N points closest to its centroid • Find labels for these points (TM or NTM) • Find the majority label, say L • Assign L to all points in the cluster • Repeat for next dense cluster Clusters with no known structures are marked for study by experiments

Design 1 Results Increase the number of data points for which we ask structure Compare how accuracy varies between guided selection (via active learning) versus random selection. A total of only 10 labels per node ~ 1% data

Design 2:Protein – based Selection • Pick a random protein • Find labels for all windows in this protein • For each node containing labels, find the mode L of all labels it contains • Assign L to remaining data in node • Repeat and update for new protein, until half have been selected

Percent Protein-based results Repeated for different permutations of protein selection order, and observed several metrics.

Conclusions • We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences. • We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.

Acknowledgements Madhavi GanapathirajuJessica Wehner JW funded through NIH-NSF Bioengineering & Bioinformatics Summer Institute Visit us at Department of Biomedical Informatics University of Pittsburgh Thank you! www.dbmi.pitt.edu/madhavi  Cathedral of Learning, University of Pittsburgh

Selecting Biomolecules for Study: Active Learning Algorithm in Protein Structure Prediction

Selecting Biomolecules for Study: Active Learning Algorithm in Protein Structure Prediction

Presentation Transcript

How To Be An Be An Effective Teacher By Harry Wong

Pathways to biomolecules

Wet Granulation Scale-up Experiments

Navigating SSEP Experiments from Selection to Flight

Wet Granulation Small Scale Experiments

Guide to Experiments!

Wet Granulation Scale-up Experiments

An Idiot’s Guide to

Response to selection can be fast!

How to Parallelize an Algorithm

Specific Cases of Selection

WET LAB: DNA Barcoding: From Samples to Sequences

From Crystallography of Biomolecules to

Lab Experiments

Anchor to LAO Lab Experiments

Sort an array - the selection sort algorithm

Complete guide to better candidate selection

WET LAB: DNA Barcoding: From Samples to Sequences

Chemwatch FFX Gold General Wet Lab User Guide

Photon Selection Algorithm

Beginners’ Guide To Poker – How To Be An Instant Pro Player