1 / 28

Application of Stacked Generalization to a Protein Localization Prediction Task

Application of Stacked Generalization to a Protein Localization Prediction Task. Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School of Computer Science and Information Systems September 27, 2003. Overview. Introduction Purpose Methods Algorithms Results

jirair
Download Presentation

Application of Stacked Generalization to a Protein Localization Prediction Task

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School of Computer Science and Information Systems September 27, 2003

  2. Overview • Introduction • Purpose • Methods • Algorithms • Results • Conclusions and Future Work

  3. Introduction

  4. Introduction: Data Mining Application of machine learning algorithms to large databases Often used to classify future data based on training set “Target” variable is variable to be predicted Theoretically, algorithms are context-independent

  5. Introduction: Stacked Generalization Method for combining models Part of training set used to train level-0, or base, models as usual Level-1 data built from predictions of level-0 models on remainder of set Level-1 Generalizers are models trained on level-1 data

  6. Introduction: Bioinformatics and Protein Localization Bioinformatics: application of computing to molecular biology Currently much interest in information about proteins Expression of proteins localized in a particular type or part of cell (localization) Knowledge of protein localization can shed light on protein’s function Data mining employed to predict localization from database of information about encoding genes

  7. Introduction: KDD Cup 2001 Task KDD Cup: Annual data mining competition sponsored by ACM SIGKDD Participants use training set to predict target variable values in test dataset of different instances Winner is most accurate model (correct predictions/total instances in test set) 2001 task: predict protein localization of genes Anonymized genes were instances, information about genes were attributes Datasets (incl. revealed target) used in this project

  8. Purpose • Use Stacked Generalization approach on this task • Compare inter-algorithm performance using level-0 models and level-1 generalizers • Evaluate strategy of equally distributing target variable

  9. Methods

  10. Methods: Dataset Manipulations Reduce number of input variables Reduce number of potential target values to 3 Separate original training dataset into training and validation sets for stacking Eliminate effectively unary variables in final training dataset

  11. Table: Target Variable Distribution

  12. Second training set created by stratifying to ensure equally distributed localizations Level-0 models trained on both raw (unequally distributed) and equally distributed training sets Separate level-1 data and level-1 generalizers from this dataset Methods: Equally Distributed Approach

  13. Algorithms

  14. Algorithms: Level-0 Artificial Neural Network (ANN) Fully connected feedforward network Input variables  dummy variables  186 input nodes Target variable  dummy variables  2 output nodes 1 hidden node Training based on change in misclassification rate

  15. Used CHAID-like algorithm Chi-squared p value splitting criterion: p < 0.2 Model selection based on proportion of instances correctly classified Algorithms: Level-0 Decision Tree

  16. Algorithms: Level-0 Nearest Neighbor (NN) Compare each instance between two datasets Count number of matching attributes Predict target value of instance matching on greatest number of attributes Use relative frequency in unequally distributed dataset to break ties

  17. Algorithms: Level-0 Hybrid Decision Tree/ANN • Difficult for ANN to learn with too many variables • Decision Tree can be used as a “feature selector” • Important variables are those used as branching criteria • New ANN trained using only important variables as inputs

  18. Algorithms: Level-1 Generalizers ANN and Decision Tree Designed and trained essentially the same as level-0 counterparts ANN had 8 input nodes Naïve Bayesian Model Calculated likelihood of each target value based on Bayes rule Predicted value with highest likelihood

  19. Results

  20. Results: Accuracy Rates

  21. Results: Evaluation of Accuracy Rates • Similar to highest-performing KDD Cup models • However, predictions drawn from much smaller pool of potential localizations • Also not much better than just predicting nucleus • Still, had fewer input variables with which to work

  22. Level-1 Decision Tree Diagram

  23. Results: Statistical Comparisons No significant inter-algorithm differences for level-0 models Hybrid offered some improvement over ANN alone Equal distribution usually resulted in slightly worse performance Stacked Generalization resulted in better performance, sometimes significantly so

  24. Conclusions and Future Work

  25. Conclusions and Future Work: Stratifying for Equal Distribution Not worth it and perhaps harmful Resulting small sample size may be to blame Could sample from full training set Other sampling approaches could be used Weight variable not necessarily meaningful

  26. Conclusions and Future Work: Specific Models Algorithms performed comparably to each other ANN may need more hidden nodes Hybrid model improved ANN’s performance slightly, but not much NN may owe some of performance to tie-breaker implementation Naïve Bayesian not standout, as might be expected Could run A Priori search first

  27. Conclusions and Future Work: Stacked Generalization in General Somewhat, not drastically, better performance Possible ways to improve performance Cross-validation could improve both performance and evaluation Use posterior probabilities instead of actual predictions Try different algorithms Continue stacking on more levels (level-2, level-3, etc.) Apply Stacked Generalization to actual KDD Cup task

  28. References • Page, D. (2001). KDD Cup 2001. Website located at http://www.cs.wisc.edu/~dpage/kddcup2001/. • Ting, K.M., Witten, I.H. (1997). Stacked generalization: when does it work?. Proc International Joint Conference on Artificial Intelligence, Japan, 866-871. • Witten, I.H., Frank, E. (2000). Data Mining. Morgan Kaufmann (San Francisco). • Wolpert, D.H. (1992). Stacked Generalization. Neural Networks, 5:241-259.

More Related