180 likes | 284 Views
This research explores an algorithm for selecting specific biomolecules for wet-lab experiments based on active learning in protein structure prediction, addressing challenges in computational techniques and experimental methods. The study focuses on membrane proteins to improve drug design and understand cell regulation pathways. The algorithm uses labeled and unlabeled proteins to guide data selection, enhancing accuracy compared to random selection. Findings indicate the potential for higher accuracy with guided selection methods in protein research.
E N D
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Presented by Thahir P. Mohamed Advancing Practice, Instruction & Innovation through Informatics October 19-23, 2008
2 Protein Structure Primary Structure: Chain of amino acids Secondary Structure: Sub-structures such as helixes and strands Tertiary Structure:Atomic resolution of protein structure Protein structure is essential for successful design of drugs
3 Challenges in Protein Structure Prediction • X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure. • Very expensive • Very time consuming • Computational techniques are applied to predictprotein structure
4 Computational Protein Structure Prediction • Machine Learning techniques applied to predict structure • Experimentally determined structures are used to learnto predict new structures • When not enough data to learn from: • Active learning is applied to select the next protein to be studied experimentally
5 Active Learning Unlabeled Proteins Possible Labels:
Active Learning Clustered Protiens Possible Labels: Cluster Unlabeled Proteins
7 Selection Algorithm Active Learning Clustered Proteins Possible Labels: Cluster Unlabeled Proteins
8 Selection Algorithm Active Learning Clustered Proteins Possible Labels: Cluster Unlabeled Proteins
9 Selection Algorithm Prediction Active Learning Labeled Protiens Possible Labels: Cluster Unlabeled Proteins Active learning guides selection of data points for which you ask for labels
Membrane Protein Structure Prediction Membrane Protein importance and challenges 10 Membrane Proteins: • 30% of genes • cell regulation and signaling pathways • 60% of drug targets Yet, • Difficult to study experimentally • 1% of known protein structures Active learning can be used as a tool against the limited number of known MP structures despite the large number of known MP sequences
‘Features’ Representation 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 Residue: A L H W R A A G A A T V L L V I V E R G A P G A Q L I Topology: - - - - - M M M M M M M M M M M M - - - - - - - - - - Charge: - - p – p - - - - - - - - - - - - n p - - - - - - - - E-Prop: D d . . A D D . D D a d d d d d d D A . D D . D a d d Properties Charge Size Polarity Aromaticity Electronic Properties Data reduction is performed by SVD, resulting in a final 4 features per window.
Dim 3 Dim 1 Dim 2 Clustering the Data • Neural Network Self Organizing Map (SOM) • Finds centroids of clusters in the data
Design 1:Density-based Selection • Find the most dense cluster • Choose N points closest to its centroid • Find labels for these points (TM or NTM) • Find the majority label, say L • Assign L to all points in the cluster • Repeat for next dense cluster Clusters with no known structures are marked for study by experiments
Design 1 Results Increase the number of data points for which we ask structure Compare how accuracy varies between guided selection (via active learning) versus random selection. A total of only 10 labels per node ~ 1% data
Design 2:Protein – based Selection • Pick a random protein • Find labels for all windows in this protein • For each node containing labels, find the mode L of all labels it contains • Assign L to remaining data in node • Repeat and update for new protein, until half have been selected
Percent Protein-based results Repeated for different permutations of protein selection order, and observed several metrics.
Conclusions • We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences. • We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.
Acknowledgements Madhavi GanapathirajuJessica Wehner JW funded through NIH-NSF Bioengineering & Bioinformatics Summer Institute Visit us at Department of Biomedical Informatics University of Pittsburgh Thank you! www.dbmi.pitt.edu/madhavi Cathedral of Learning, University of Pittsburgh