110 likes | 208 Views
This perspective explores the role of data in machine learning, focusing on supervised vs. unsupervised learning and the importance of representative training sets for predictive tasks. It delves into the challenges of subtyping cancer patients, selecting relevant features, and measuring improvement against competitors. The discussion includes strategies to address missing data and negative control features, emphasizing robustness and accuracy in model development.
E N D
A Perspective on the Data Ajit Paul Singh M.Sc. Candidate Dept. of Computing Science University of Alberta
Machine Learning • Systems that use experience to improve at a given task. • Data as experience • Supervised vs. Unsupervised Learning SNP focus: supervised learning
Data Assumptions • Samples are independent, and identically distributed (IID) • Dealing with patients/tuples • One set complex distribution more training data • Split into subsets many simpler distribution less training data per problem
Defining the Task • Predictive • Diagnosing members of the public • Rare class issue • Diagnosing clinic referrals • Is the training set representative of patients that will be tested ? • Subtyping cancer patients • Feature Selection • Find interesting SNPs for further study
Measuring Improvement • Competitors • Human experts using clinical data • Diagnostic tests (e.g. BRCA1 truncations) • Other learners using genetic markers • Benefits of Polyomx • Accuracy, Cost, Speed • Need for a baseline to compare against
Issues to Consider • Missing data • Negative control features
Types of Missing Data • Missing Completely At Random (MCAR) • Missing At Random (MAR) • Censored
Negative Control Features • SNPs were hand selected • Feature selection problem • Measuring relevance of selected features • Prediction problem • Ensuring the learner is robust • Add negative control features • Features that are probably irrelevant