Perspective on Data in Machine Learning: Enhancing Predictive Models and Feature Selection for Improved Outcomes

A Perspective on the Data Ajit Paul Singh M.Sc. Candidate Dept. of Computing Science University of Alberta

Machine Learning • Systems that use experience to improve at a given task. • Data as experience • Supervised vs. Unsupervised Learning SNP focus: supervised learning

The Running Example

Data Assumptions • Samples are independent, and identically distributed (IID) • Dealing with patients/tuples • One set  complex distribution  more training data • Split into subsets  many simpler distribution  less training data per problem

Defining the Task • Predictive • Diagnosing members of the public • Rare class issue • Diagnosing clinic referrals • Is the training set representative of patients that will be tested ? • Subtyping cancer patients • Feature Selection • Find interesting SNPs for further study

Measuring Improvement • Competitors • Human experts using clinical data • Diagnostic tests (e.g. BRCA1 truncations) • Other learners using genetic markers • Benefits of Polyomx • Accuracy, Cost, Speed • Need for a baseline to compare against

Issues to Consider • Missing data • Negative control features

Types of Missing Data • Missing Completely At Random (MCAR) • Missing At Random (MAR) • Censored

Negative Control Features • SNPs were hand selected • Feature selection problem • Measuring relevance of selected features • Prediction problem • Ensuring the learner is robust • Add negative control features • Features that are probably irrelevant

Perspective on Data in Machine Learning: Enhancing Predictive Models and Feature Selection for Improved Outcomes

Perspective on Data in Machine Learning: Enhancing Predictive Models and Feature Selection for Improved Outcomes

Presentation Transcript

A perspective on data quality

A perspective on partnership

A Perspective on Entrepreneurship

Linked Data A Personal Perspective

A Clinical Perspective on the Future

A Perspective on Fostering the Use of Nontraditional Air Quality Data

A Perspective on Alcohol

A Strategic Perspective on

Research Data A funder’s perspective

e-Infrastructures: the European Perspective on Scientific Data

DATA: The Issues From A Publisher’s Perspective

A Perspective on Preservation of Linked Data

Data Liberation Initiative A historical perspective on the national accounts

A User Perspective on BED and LED data

A Perspective on Exercise

Optimal distance estimation on compressed data (the data mining perspective)

Measuring the data universe: A management perspective on data integration using SDMX

A perspective on visualization

A Regulatory Perspective on Electronic Data Capture

DATA: The Issues From A Publisher’s Perspective

Data Liberation Initiative A historical perspective on the national accounts