1 / 6

Semi-supervised learning for protein classification

Semi-supervised learning for protein classification. Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY. The problem.

odelia
Download Presentation

Semi-supervised learning for protein classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY

  2. The problem • Develop computational models of characteristics of protein structure and function from sequence alone using machine-learned classifiers • Input: Data • Output: A model (function) h : X  Y • Traditional approach: supervised learning • Challenges: • Experimentally determined data – Expensive, limited, subject to noise/error • Large repositories of unannotated data • Data representation, bias from unbalanced / underrepresented classes, etc. TrEMBL 37.5: 5,035,267 Swiss-Prot 54.5: 289,473 AIM: Develop a method to use labeled and unlabeled data, while improving performance given the challenges presented by small, unbalanced data

  3. Solution • Semi-supervised learning • Use Dl and Du for model induction • Method: Generative, Bayesian probabilistic model • Based on ngLOC – supervised, Naïve Bayes classification method • Input / Feature Representation: Sequence n-gram model • Assumption – multinomial distribution • IID – Sequence and n-grams • Use EXPECTATION MAXIMIZATION! • Test setup • Prediction of subcellular localization • Eukaryotic, non-plant sequences only • Dl : Data annotated with subcellular localization for eukaryotic, non-plant sequences • DL-2 – EXT/PLA (~5500 sequences, balanced) • DL-3 – GOL [65%] / LYS [14%] /POX [21%] (~600 sequences, unbalanced) • Du : Set from ~75K eukaryotic, non-plant protein sequences. • Comparative method • Transductive SVM

  4. Algorithms based on EM EM-λ on DL-3 data • λ – controls effect of UL data on parameter adjustments • ALL labeled data (~600) • Varied UL data • EM- λ outperforms TSVM on this problem • (Failed to converge on large amounts of UL data, despite parameter selection) • NOTE – TSVM performed very well on binary, balanced classification problems Basic EM on DL-2 • Varied labeled data • 25,000 UL sequences • Most improvement when data is limited

  5. Algorithm – EM-CS • Core ngLOC method outputs a confidence score (CS) • Improve running time through intelligent selection of unlabeled instances • CS(xi) > CSthresh? Use the instance • Test on DL-3 data: First, determine range of CS scores through cross-validation without UL: 33.5-47.8 (Dependent on level of similarity in data, size of dataset.) Using only sequences that meet or exceed CSthresh significantly reduces UL data required (97.5% eliminated) NOTE: it is possible to reduce UL data too much.

  6. Conclusion • Benefits: • Probabilistic • Extract unlabeled sequences of “high-confidence” • Difficult with SVM or TSVM • Extraction of knowledge from model • Discriminative n-grams and anomalies • Information theoretic measures, KL-divergence, etc. • Again, difficult with SVM or TSVM • Computational resources • Time: Significantly lower than SVM and TSVM • Space: Dependent on n-gram model • Can use large amounts of unlabeled data • Applicable toward prediction of any structural or functional characteristic • Outputs a global model • Transduction is not global! • Most substantial gain with limited labeled data • Current work in progress: • TSVMs • Improve performance on smaller, unbalanced data • Select an improved smaller dimensional feature space representation • Ensemble classifiers, Bayesian model averaging, Mixture of experts

More Related