data mining and machine learning in population health studies n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining and Machine Learning in Population Health Studies PowerPoint Presentation
Download Presentation
Data Mining and Machine Learning in Population Health Studies

Loading in 2 Seconds...

play fullscreen
1 / 40

Data Mining and Machine Learning in Population Health Studies - PowerPoint PPT Presentation

  • Uploaded on

Data Mining and Machine Learning in Population Health Studies. Marina Sokolova Dept of ECM and School of EECS, University of Ottawa Institute for Big Data Analytics. Data Mining. Science and technology that discover new knowledge in large data sets Vast amount of accumulated data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Data Mining and Machine Learning in Population Health Studies' - petra-leonard

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data mining and machine learning in population health studies

Data Mining and Machine Learning in Population Health Studies

Marina Sokolova

Dept of ECM and School of EECS,

University of Ottawa

Institute for Big Data Analytics

data mining
Data Mining
  • Science and technology that discover new knowledge in large data sets
    • Vast amount of accumulated data
      • XXX,XXX,XXX records from health insurance companies in the NY state alone

=> automated methods

    • Ever-changing data
      • New drugs, tests change the problem

=> adaptive methods

    • Beyond human processing capacities

structured data
Structured Data

Databases, mostly organizational

unstructured data
  • Text
    • He had an uncomplicated postoperative course and he was transferred . Advanced his diet on postop day # 4 to a transitional diet ...
    • Experts fear that Ebola will mutate and become spreadable via cough or sneeze ...
  • Images

privacy protection
Privacy Protection
  • Individuals cannot be uniquely identified from the data set
  • Mandatory for health data custodians and human subject studies
  • HIPPA, PHIPA, etc.
  • Privacy-preserving methods
    • De-identification, i.e. a severing of a data set from the identity of the data contributor, but may include identifying information which could be re-linked by a trusted party in certain situations
    • Anonymization, i.e., irreversibly severing a data set from the identity of the data contributor

data mining process
Data Mining Process

Step 1: Data pre-processing

  • Sample selection
  • Noise reduction
  • Unstructured to structured transformation
  • Privacy protection

Step 2: Information processing

  • Record classification
  • Clustering
  • Association rule mining

Step 3: Evaluation

  • Performance assessment
  • Result interpretation

machine learning
Machine Learning
  • Ability of algorithms to discover properties in previously unseen data, based on known properties found in training data
  • Algorithmic “muscles” of Data Mining
  • Common tasks:
    • Classification of instances
    • Clustering of instances

more on ml tasks
More on ML tasks
  • Classification/supervised learning
    • An algorithm assigns data items into pre-defined categories (e.g., No, < 30, >30)
    • Categories do not over-lap
      • Binary classification is the most common
    • There could be more than one category for an item (multi-labelled classification):C + Female + [10-20)
  • Clustering/unsupervised learning
    • Grouping data items according to their similarities
    • Clusters usually do not over-lap

essential parts of ml
Essential Parts of ML
  • learning modes
  • training and test stages
  • model selection (validation and testing, cross-validation, leave-one-out)
  • algorithms  (e.g., K-NN, Naïve Bayes, Support Vector Machines)
  • Performance evaluation

learning modes
Learning Modes
  • Classification/Supervised
    • Data items are labelled
      • One page of a professionally annotated text from a medical domain - $10,000
      • 600 personal health records - $1,500 for de-identification and 1-2 months for an experienced Research Assistant to extract relevant information ($4,500 + overhead) . Note that we usually need thousands of records!
    • The most accurate results
  • Clustering/Unsupervised
    • Data items are not labelled
    • Plenty of such data
    • Hard to evaluate, usually approximate results
  • Semi-supervised
    • A mixture of labelled and unlabelled data

training and test stages
Training and Test Stages
  • Training and test data
    • Data sets are split into non-overlapping parts
    • Training sets are usually bigger than test sets
  • An algorithm is applied on the training set;
    • Its results are verified either automatically (supervised learning) or manually (non-supervised learning)
    • The algorithm parameters are adjusted depending on the results
  • The model with the best results is applied onthe test set
  • Errors are counted on the test set only!

the model selection
The Model Selection
  • Validation and test
    • Divide the initial set into 3 parts (training, validation, test)
    • Use 1 part for training and 1 part for validation
    • Apply on the test part
  • Cross-validation
    • Divide the initial set into 5 (10) parts
    • Use 4 (9 )parts for training and 1 partfor test
    • Repeat 5(10 ) times for a new set of training and test parts
  • Leave-one-out
    • Use all items but one for training
    • Apply the algorithm on the remaining item
    • Repeat for all data items

  • Probability-based (Naïve Bayes)
  • Prototype-based (K-NN)
  • Optimization-based (SVM)
  • Decision-based (Decision Trees)

performance measures
Performance Measures
  • Accuracy = (tp + tn)/(tp + tn + fp + fn)
  • Precision (Pr) = tp/(tp + fp)
  • Recall (R) = tp/(tp + fn)
  • F-score = 2PrR/(Pr + R)

new frontiers personal health information on the web
New Frontiers: Personal Health Information on the Web
  • Infodemiology studies the determinants and distribution of health information on the Internet (GuntherEysenbach, 2004)
    • Google Trends
    • BioCaster
  • 19 % - 28.5 % of all the Internet users to participate in online health-related discussions.
  • Growth of Internet of Things is expected significantly increase sharing of personal health information
    • Privacy protection has to be adjusted/re-developed

personal health information
Personal Health Information
  • Personal health information (PHI) is information about one’s health discussed by a patient in a clinical setting
  • PHI is the most vulnerable private information posted online
    • I have a family history of Alzheimer's disease. I have seen what it does and its sadness is a part of my life. I am already burdened with the knowledge that I am at risk.
    • We're going for the basic blood tests, the NT scan, and the "Ashkenazi panel" since both XX and I are Jewish from E. European descent.

Privacy Protection in Big Data Analytics

research questions
Research Questions

Q1. Do people talk about health?

Q2. How do people talk about health?

Q3. What emotions can be found in health discussions?

challenges of phi retrieval information extraction
Challenges of PHI Retrieval (Information Extraction)

General health information: they are promoting cancer awareness particularlylungcancer

Personal health information: I had a rare condition and half of mylunghad to be removed

Irrelevant: I saw a guy chasing someone and screaming at the top of hislungs

Terminologythe transfer went well - my REdid it himself which was comforting. 2 embies(grade 1 but slow in development) so I am not holding my breath for a positive

Technical termsSomeone with 50 DB hearing aid gain with atotal loss of 70 DB may not know that the place is producing 107 DBsince it may not appear too loud to him since he only perceives 47 DB

challenges of phi understanding semantic analysis
Challenges of PHI Understanding (Semantic Analysis)

Privacy Protection in Big Data Analytics

challenges of medical electronic resources
Challenges of Medical Electronic Resources
  • Electronic medical dictionaries are developed to analyze scientific publications
    • the Medical Dictionary for Regulatory Activities (MedDRA):

8,561 unique terms/86 PHI terms

    • the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT):

44,802 unique terms/108 PHI terms

Privacy Protection in Big Data Analytics

our approach
Our Approach
  • Humans in the loop – manual annotation of data samples (Supervised learning)
  • Advanced methods in data pre-processing
    • Sentence splitting, tokenization, part of speech tagging, lemmatization for nouns and verbs
  • PHI resource building (e.g., ontology of PHI terms, HealthAffect lexicon)
  • Use of robust algorithms
    • Naive Bayes
  • Appropriate evaluation methods
    • fn estimation

Privacy Protection in Big Data Analytics

data sources
Data Sources
  • Online medical forums
    • IVF
    • Hearing loss
    • Newborn screening for rare diseases
  • Social networks
    • MySpace
    • Twitter
    • Facebook

q1 do people talk about health
Q1. Do people talk about health?
  • In randomly selected 1000 tweet threads, 15% threads revealed personal health information
  • In randomly selected 11800 MySpace posts, 6% posts discussed personal health
  • On IVF forums, participants (women 95%) mostly talk about health

q1 it all depends on the context
Q1: It all depends on the context
  • On HL forums, participants talk about health and quality of life/life style
  • On newborn screening for rare diseases, parents often discuss privacy and physical hurt; at the same time, they seldom talk about health
  • In a student network on Facebook, participants do NOT talk about health

q2 how people talk about health
Q2: How people talk about health
  • Simple language
    • For me the laser treatment had unpleasant side-effects.
    • …got a huge bump on my forehead, fractured my nose.
  • Basic concepts
    • Concussion, thyroid, asthma, fracture, hypothermia
    • Cold, flu, injury, headache
  • Exception: Hearing Loss discussions involve more specific terms than other discussions

q3 what emotions can be found in health discussions
Q3. What emotions can be found in health discussions?
  • Range of emotions depends on the content of health issues
    • Positive/negative/neutral on Twitter and HL forums
    • Gratitude, encouragement, endorsement, confusion on IVF forums
  • Strength of emotional disclosure varies
    • Outspoken emotional posts on newborn screening and IVF
    • Muted emotions on MySpace

performance evaluation
Performance Evaluation
  • We detect PHI:
    • False negatives on social networks (11,800 messages) – 0.003/baseline 0.031
    • False negatives on peer-to-peer networks (2,300 documents) – 0.000/baseline 0.031
  • We recognize PHI:
    • Precision on Twitter (1000 threads) - 0.770/baseline 0.419
  • We identify PHI-related opinions:
    • F-score on HL forums (3515 sentences) - 0.685/baseline 0.584

Privacy Protection in Big Data Analytics

data sets used in population health studies
Data Sets Used in Population Health Studies
  • Indian Liver Patient Dataset
  • Breast Cancer Wisconsin (Diagnostic) Data Set
  • Haberman's Survival Data Set (breast cancer, 1999)

  • Many more

useful links
Useful links
  • Weka 3: Data Mining Software – open source!

  • Support Vector Machine – open source!

  • Andrew Ng’s (Stanford) web site with video lectures on ML

  • Benchmark data sets repository

thank you
Thank you!


probability based na ve bayes
Probability-based: Naïve Bayes
  • Assumes that all the informative features are independent AND identically distributed.
  • Both assumptions are generally not true.

being optimistic does not hurt
Being Optimistic Does not Hurt
  • Naïve Bayes can outperform sophisticated classifiers!

prototype based k nearest neighbor
Prototype-based: K-nearest neighbor
  • Uses observations in the training set T closest in the input space to the entry x to form conclusion Y .
  • Y can be a predicted class label of x.
  • Useful in practical applications

a closer look at k neighbors
A closer look at K neighbors

Labels for the test example:

2-NN: Green

3-NN: Green

4-NN: Ambiguous

5-NN: Red

6-NN: Red

7-NN: Red.

good bad things about knn
Good/bad things about KNN
  • Only two adjustable parameters:
    • Number of neighbors
    • Closeness (i.e., distance between neighbors)
  • The output is easy to understand
  • Highly depends on the training data, population sample

optimization based algorithms support vector machines
Optimization-based algorithms: Support Vector Machines
  • Highly accurate classifiers
  • Extremely popular for publications
  • Seldom used in practice

support vector machines
Support Vector Machines

Labels for the test example:

  • Hyper-planes in action:
    • various dimensions
    • linear hyper-planes differ by soft margins

good bad things about svm
Good/bad things about SVM
  • Several adjustable parameters
    • Dimensions of discriminative hyper-planes
    • Kernel functions
    • Soft-margin
  • Every parameter matters
    • Almost a random choice

decision based algorithms
Decision-based algorithms
  • Decision Trees
  • Decision Lists

Can beat SVM when efficiency is as much important as effectiveness!