1 / 31

Machine Learning Challenges Comp Bio 02-750

Machine Learning Challenges Comp Bio 02-750. Jaime Carbonell , Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu /~jgc 6 September 2012. Today’s topics. Active Learning in Beyond Classification Rank Learning Active Rank Learning Coping with Missing Values

Download Presentation

Machine Learning Challenges Comp Bio 02-750

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Challenges Comp Bio02-750 Jaime Carbonell, Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 6 September 2012

  2. Today’s topics • Active Learning in Beyond Classification • Rank Learning • Active Rank Learning • Coping with Missing Values • Imputation to the mean • More Advanced Imputation • Coping with imbalanced classes • Minority class discovery & classification • Protein-Protein Interactions: Case in point Jaime G. Carbonell, Language Technolgies Institute

  3. Active Sampling for RankSVM I • Consider a candidate • Assume is added to training set with • Total loss on pairs that include is: • n is the # of training instances with a different label than • Objective function to be minimized becomes: Jaime G. Carbonell, Language Technolgies Institute

  4. Active Sampling for RankSVM II • Assume the current ranking function is • There are two possible cases: • Assume • Derivative w.r.t at a single point or Jaime G. Carbonell, Language Technolgies Institute

  5. Active Sampling for RankSVM III • Substitute in the previous equation to estimate • Magnitude of the total derivative • estimates the ability of to change the current ranker if added into training • Finally, Jaime G. Carbonell, Language Technolgies Institute

  6. Active Sampling for RankBoost I • Again, estimate how the current ranker would change if was in the training set • Estimate this change by the difference in ranking loss before and after is added • Ranking loss w.r.t is (Freund et al., 2003): Jaime G. Carbonell, Language Technolgies Institute

  7. Active Sampling for RankBoost II • Difference in the ranking loss between the current and the enlarged set: • indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance • Finally, the instance with the highest loss differential is sampled: Jaime G. Carbonell, Language Technolgies Institute

  8. Performance Measures • MAP (Mean Average Precision) • MAP is the average of AP values for all queries • NDCG (Normalized Discounted Cumulative Gain) • The impact of each relevant document is discounted as a function of rank position Jaime G. Carbonell, Language Technolgies Institute

  9. Results on TREC03 Jaime G. Carbonell, Language Technolgies Institute

  10. What is Missing? • In active learning the category label is missing, and we can query an oracle, mindful of cost • What else can be missing? • Features: we may not have enough for prediction • Feature combinations: beyond those the classifier is able to generate automatically (e.g. XOR, ratios) • Values of features: Not all instances have values for all their features. • Feature relevance: Some features are noisy or irrelevant • Feature redundancy: e.g. high feature co-variance Jaime G. Carbonell, Language Technolgies Institute

  11. Reducing the Feature Space • Feature selection • Subsample features using IG, MI, … • Well studied, e.g. Yang & Pedersen ICML 1997 • Wrapper methods • Inefficient but accurate, less studied • Feature projection (to lower dimensions) • LDA, SVD, LSI • Slow, well studied, e.g. Falluchi et al 2009 • Kernel functions on feature sub-spaces Jaime G. Carbonell, Language Technolgies Institute

  12. Missing Feature Values • Active learning of features • Not as extensively studied as active instance learning (See Saar-Tsechansky et al, 2007) • Determines which feature values to seek for given instances, or which features across the board • Can be combined with active instance learning • But, what if there is no oracle? • Impossible to get feature values • Too costly or too time consuming • Do we ignore instances with missing features? Jaime G. Carbonell, Language Technolgies Institute

  13. Missing Data Jaime G. Carbonell, Language Technolgies Institute

  14. How to Cope with Missing Features • ML training assumes feature completeness • Filter our features that are mostly missing • Filter out instances with missing features • Impute values for missing features • Radically change ML algorithms • When do we do each of the above? • With lots of data and few missing features… • With sparse training data and few missing… • With sparse data and mostly missing features… Jaime G. Carbonell, Language Technolgies Institute

  15. Missing Feature Imputation • How do we estimate missing feature values? • Infer the mean value across all instances • Infer the mean value in neighborhood • Apply a classifier with other features as input and missing feature value as y (label) • How do we know if it makes a difference? • Sensitivity analysis (extrema, pertubations) • Train without instances with missing features vs instances with imputed values for missing features Jaime G. Carbonell, Language Technolgies Institute

  16. More on Missing Values • Missing Completely at Random (MCAR) • It is generally impossible to prove MCAR or MAR • Missing at Random (MAR) • Statisticians assume MAR as default • Missing values that depend on observables • Imputation via classification/regression • Missing valued that depend on unobservables • Missing depending on the value itself Jaime G. Carbonell, Language Technolgies Institute

  17. Imputation – Example[From: Fan 2008] • How to impute the missing SCL for patient # 5? • Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7 • By age: (3.8+0.6)/2 = 2.2 • By sex: 1.1 • By education: 1.3 • By race: (3.8 + 0.6 + 1.3)/3 = 1.9 • By ADL: (1.1 + 1.3)/2 = 1.2 • Who is/are in the same “slice” with #5? Jaime G. Carbonell, Language Technolgies Institute

  18. Further Reading • Saar-Tsechansky& Provost http://www.springerlink.com/content/k5m57475n1658723/fulltext.pdf • Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization ICML 1997, pp412-420 • Gelman chapter: http://www.stat.columbia.edu/~gelman/arm/missing.pdf • Applications in biomed: Lavori, P., R. Dawson and D. Shera (1995) “A Multiple Imputation Strategy for Clinical Trialswith Truncation of Patient Data.” Statistics in Medicine 14: 1913-1925. Jaime G. Carbonell, Language Technolgies Institute

  19. UnbalancedClasses in ML Classifier Unbalanced Unlabeled Data Set Rare Category Detection Learning in Unbalanced Settings Feature Extraction Feature Representation Raw Data Relational Temporal Jaime G. Carbonell, Language Technolgies Institute

  20. Minority Class Discovery Method 1. Calculate problem-specific similarity 2. , , Relevance Feedback Increase t by 1 3. 4. Query No 5. a new class? Yes 6. Output No 7. Budget exhausted? Jaime G. Carbonell, Language Technolgies Institute

  21. Scoring Function • The estimated density • Scoring function: norm of the gradient where Jaime G. Carbonell, Language Technolgies Institute

  22. Abalone 4177 examples 7-dimensional features 20 classes Largest class: 16.50% Smallest class: 0.34% Summary of Real Data Sets • Shuttle • 4515 examples • 9-dimensional features • 7 classes • Largest class: 75.53% • Smallest class: 0.13% Jaime G. Carbonell, Language Technolgies Institute

  23. Results on Real Data Sets Abalone Shuttle MALICE MALICE Interleave Interleave Random sampling Random sampling Jaime G. Carbonell, Language Technolgies Institute

  24. Computational Virology via PPI’s 24 Degree distribution / Hub analysis / Disease checking Graph modules analysis (from bi-clustering study) Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc ) Jaime G. Carbonell, Language Technolgies Institute

  25. Fusion Reverse transcription Transcription Maturation Budding Peterlin and Torono, Nature Rev Immu 2003. HIV-1 host protein interactions HIV-1 depends on the cellular machinery in every aspect of its life cycle. Jaime G. Carbonell, Language Technolgies Institute

  26. PPIs: Protein-Protein Interactions • The cell machinery is run by the proteins • Enzymatic activities, replication, translation, transport, signaling, structural • Proteins interact with each other to perform these functions Through physical contact Indirectly in a protein complex Indirectly in pathway Jaime G. Carbonell, Language Technolgies Institute http://www.cellsignal.com/reference/pathway/Apoptosis_Overview.html

  27. Keywords: binds, cleaves, interacts with, methylated by, myristoylated by etc … Keywords: activates, associates with, causes accumulation of etc … Interactions reported in NIAID “Nef binds hemopoietic cell kinase isoform p61HCK” • Group 1: more likely direct • Group 2: could be indirect • 1063 interactions • 721 human proteins • 17 HIV-1 proteins • 1454 interactions • 914 human proteins • 16 HIV-1 proteins HIV-1 protein Human protein http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/ Jaime G. Carbonell, Language Technolgies Institute

  28. Sources of Labels • Literature • Lab Experiments • Human Experts Feature Importance Active Selection of Instances and Reliable Labelers Jaime G. Carbonell, Language Technolgies Institute

  29. Estimating expert labeling accuracies Solve this through expectation maximization Assuming experts are conditionally independent given true label Jaime G. Carbonell, Language Technolgies Institute

  30. Refined interactome Solid line: probability of being a direct interaction is ≥0.5 Dashed line: probability of being a direct interaction is <0.5 Edge thickness indicates confidence in the interaction Jaime G. Carbonell, Language Technolgies Institute

  31. THANK YOU! Jaime G. Carbonell, Language Technolgies Institute

More Related