Assessing the Performance of Macromolecular Sequence Classifiers

Assessing the Performance of Macromolecular Sequence Classifiers CorneliaCaragea(cornelia@cs.iastate.edu) Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007 Research supported in part by a grant from the National Institutes of Health (GM066387).

Background and Motivation • Machine Learning methods offer some of the most cost-effective approaches to building predictive models • One problem – multiple approaches • Needed: comparing the effectiveness of different predictive classifiers • Difficulty: different data selection and evaluation procedures Research supported in part by a grant from the National Institutes of Health (GM066387).

Outline • Macromolecular Sequence Classification • Performance Evaluation • Window-Based Cross-Validation • Sequence-Based Cross-Validation • Experiments • Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387).

H3N+ M L I L K T I F L R P S C S L L L T S Q Q COO- E I D S E Glycosylated? Phosphorylated? Macromolecular Sequence Classification • Predict a label for each element in a given sequence • Example: • Identify post-translational modification residues Research supported in part by a grant from the National Institutes of Health (GM066387).

Macromolecular Sequence Classification • Example: • Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA 0000000000000000111110010000000000000001100100000000000000000000010000000001111100000000000000000 Research supported in part by a grant from the National Institutes of Health (GM066387).

Training Data Learning System Resulting Classifier Performance on test set Validation Test Data All Data Macromolecular Sequence Classification Research supported in part by a grant from the National Institutes of Health (GM066387).

. . . VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1 . . . Target residue Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: 1111110011111110011111001011111100000001111101000000 Class label Macromolecular Sequence Classification • Sliding Window Approach: Research supported in part by a grant from the National Institutes of Health (GM066387).

S1 S2 Sk-1 Sk Learn classifier C Evaluate classifier C repeat k times Performance Evaluation K-Fold Cross-Validation: Research supported in part by a grant from the National Institutes of Health (GM066387).

S1 S2 Sk-1 Sk windows Learn classifier C Evaluate classifier C repeat k times Window-Based Cross-Validation Procedure: • Extract windows from all sequences in the dataset • Partition the set of windows into k disjoint subsets • Perform standard cross-validation Research supported in part by a grant from the National Institutes of Health (GM066387).

S1 S2 Sk-1 Sk sequences Learn classifier C Evaluate classifier C repeat k times Sequence-Based Cross-Validation Procedure: • Partition the set of sequences into k disjoint subsets • Extract windows from sequences in each subset • Perform standard cross-validation Research supported in part by a grant from the National Institutes of Health (GM066387).

Window-Based vs. Sequence-Based Cross-Validation • Window-Based Cross-Validation: • Train and test sets are likely to contain some windows that originate from the same sequence. • This violates the independence assumption between train and test sets. • Sequence-Based Cross-Validation: • Windows belonging to the same sequence end up in the same set. Research supported in part by a grant from the National Institutes of Health (GM066387).

Machine Learning Classifiers • Support Vector Machine: • 0/1 String Kernel • Example: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[xi=yi]= 010010010000000 • Naïve Bayes: • Identity Window: VKKFGGEVVKAGNIL x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L Research supported in part by a grant from the National Institutes of Health (GM066387).

Datasets • O-GlycBase dataset: • contains experimentally verified glycosylation sites • http://www.cbs.dtu.dk/databases/OGLYCBASE/ • RNA-Protein Interface dataset, RB147 : • consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. • http://bindr.gdcb.iastate.edu/RNABindR/ • Protein-Protein Interface dataset: • consists of protein-binding protein sequences Research supported in part by a grant from the National Institutes of Health (GM066387).

Datasets Number ofpositive and negative instances used in our experiments Research supported in part by a grant from the National Institutes of Health (GM066387).

Experimental Design Questions: • How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation? • How do the results vary when we vary the size of the dataset? Research supported in part by a grant from the National Institutes of Health (GM066387).

Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase Research supported in part by a grant from the National Institutes of Health (GM066387).

Results AUC CC c) Protein-Protein Interface b) RNA-Protein Interface a) O-glycBase Research supported in part by a grant from the National Institutes of Health (GM066387).

Conclusions • Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation. • The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV. • Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence. Research supported in part by a grant from the National Institutes of Health (GM066387).

Vasant Honavar Jivko Sinapov Drena Dobbs Research supported in part by a grant from the National Institutes of Health (GM066387).

Assessing the Performance of Macromolecular Sequence Classifiers

Assessing the Performance of Macromolecular Sequence Classifiers

Presentation Transcript

Assessing Company Performance

Classifiers

Assessing Knowledge and Performance

Sequence analysis: Macromolecular motif recognition

Mining Sequence Classifiers for Early Prediction

Classifiers

Assessing Performance

Functions of Classifiers

Boosting of classifiers

Assessing the Future Performance Characteristics of IC Engines

Classifiers

Assessing Business Performance

Assessing the Performance of Inpatient Mental Health Care

Assessing the Environmental Performance of Existing Buildings

Organization of Macromolecular Complexes

Assessing Knowledge and Performance

Visualization of Macromolecular Structures

The Value of Assessing and Measuring Governance Performance

Assessing Student Performance

Classifiers

Assessing and Understanding Performance