SMC’13
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

SMC’13 PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

SMC’13. “Big Data to Knowledge” in the Health Sciences: The Application and Value of Cancer Infodemiology Georgia Tourassi, PhD. 2013 Smoky Mountains CSE Conference Gatlinburg, TN September 5, 2013 . Environmental Cancer Risk and Migration Pattern. PIs: Georgia Tourassi / Songhua Xu.

Download Presentation

SMC’13

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Smc 13

SMC’13

“Big Data to Knowledge” in the Health Sciences: The Application and Value of Cancer InfodemiologyGeorgia Tourassi, PhD

2013 Smoky Mountains CSE ConferenceGatlinburg, TN

September 5, 2013


Environmental cancer risk and migration pattern

Environmental Cancer Risk and Migration Pattern

PIs: Georgia Tourassi / Songhua Xu


Environmental cancer risk and migration pattern1

Environmental Cancer Risk and Migration Pattern


Infodemiology

Infodemiology

  • “The epidemiology of digital (mis)information”

  • “The Internet has made measurable what was previously immeasurable: The distribution of health information in a population, tracking (in real time) health information trends over time, and identifying gaps between information supply and demand.”

    • G Eysenbach, Am J Med 2002


Infodemiology in action

Infodemiology in Action

http://www.google.org/flutrends/video/GoogleFluTrends_USFluActivity.mov


Applications areas

Applications Areas

Detecting and quantifying disparities in information availability

Monitoring public health relevant publications on the Internet

Tracking effectiveness of health marketing campaigns

Monitoring health related behaviors

Syndromicsurveillance

Unknown drug side effects and complications

….


Social media use among internet users

Social Media Use among Internet Users

Chou, WS et al. 2009. Social Media Use in the US: Implications for health communication, J Med Internet Res, 1(4): e48.


Social media use among internet users1

Social Media Use among Internet Users

Chou, WS et al. 2009. Social Media Use in the US: Implications for health communication, J Med Internet Res, 1(4): e48.


Cancer community

Cancer Community

  • One in five internet users with cancer

  • A growing number of cancer patients share online

    • their personal stories regarding their symptoms, treatments, emotional and physical concerns, and many other issues arising throughout the cancer diagnosis, treatment, and recovery phases.

  • Promising potential of knowledge discovery via analyzing user generated content in online cancer communities.


Case study 1

CASE STUDY 1

Parity and

Breast Cancer Risk


Case control study

Case-Control Study

Knowledge Discovery

Childbirth

Childbirth

No Childbirth

No Childbirth

Cases with Breast Cancer

Population

Controls without Breast Cancer


Conventional data collection

Conventional DataCollection

Hospitals

Organizations

Institutes


Proposed data collection

Proposed DataCollection

Obituaries

On-Line Obituaries


Web crawling and text parsing

Web Crawling and Text Parsing

WebCrawler

LocalNewspaperWebsites


Information retrieval age

Information Retrieval - Age


Information retrieval gender

Information Retrieval - Gender


Information retrieval childbirth

Information Retrieval - Childbirth


Information retrieval cause of death

Information Retrieval - Cause of Death


Data collection

Data Collection

  • Obituaries published online 2000-2012

    • 59,002 w/ “breast cancer”

    • 50,927 w/out “breast cancer”

  • After “cleaning”

    • 20,332 case group

      • 15,946 w/ at least one biological child

    • 15,954 control group

      • 13,548 w/ at least one biological child


Case and control groups

Case and Control Groups

20,332 women

15,946 with children

78.4%

15,954 women

13,548 with children

84.9%


Childbirth incidence

Childbirth Incidence


Odds ratio

Odds Ratio

  • Ages from 30-69 Years Old

  • Age-Adjusted by 2010 US Standard Population

  • Odds of Childbirth Incidence in the Case Group:

    • 13643 / 4284 = 3.2

  • Odds of Childbirth Incidence in the Control Group:

    • 6556 / 1545 = 4.2

  • Odds (of Childbirth Incidence) Ratio = 0.74, CI:(0.69,0.79)


Reliability

Reliability?


Number of children breast cancer risk

Number of Children & Breast Cancer Risk


Sample size

Sample Size


Discussion

Discussion

  • Limitation of obituaries

    • Cannot derive effect of additional factors (e.g., age at first pregnancy, breastfeeding, lifestyle choices)

  • Other types of online patients’ personal life stories can overcome these limitations


Case study 2

CASE STUDY 2

Geospatial Cancer Mortality Trends in the US


Web mining for deriving geospatial cancer mortality trends in the us

Web Mining for Deriving Geospatial Cancer Mortality Trends in the US

  • Collecting, compiling, and reporting the related surveillance statistics is a time consuming process introducing substantial delays in the monitoring process.

    • We propose to study whether general cancer mortality trends can be adequately captured by automated analysis of text content found in online obituaries published in US newspapers.


Method overview

Method Overview

  • We implemented a obituary crawler to collect large number of obituaries from online local newspapers.

  • We implemented a rule-based natural language system to transform the collected obituary documents into a structured format.

  • We applied two correction factors to account for anticipated biases of the statistics derived from the collected dataset.

  • We compare statistic reports derived from the collected obituary dataset with the cancer mortality statistics reports published by SEER to show that we can generate more accurate cancer mortality reports from the collected dataset.


System architecture

Web

Web

Web

System Architecture

Sequential Crawler

Sequential Crawler

Sequential Crawler

Random Crawler

Random Crawler

Random Crawler

Web Crawling

Dictionary

Dictionary

Dictionary

rdm_sm

rdm_sm

rdm_sm

kwd_bc

kwd_bc

kwd_bc

rdm_lg

rdm_lg

rdm_lg

kwd_lc

kwd_lc

kwd_lc

Html Documents

Html Documents

Html Documents

dic_age

dic_age

dic_age

Metadata Reference

Metadata Reference

Metadata Reference

Content Extraction Module

Content Extraction Module

Content Extraction Module

Context Extraction Module

Context Extraction Module

Context Extraction Module

Pre-processing

dic_year

dic_year

dic_year

Extracted Context

Extracted Context

Extracted Context

Obituary Content

Obituary Content

Obituary Content

Rule-based Context Inference Module

Rule-based Context Inference Module

Rule-based Context Inference Module

dic_gender_male

dic_gender_male

dic_gender_male

Context Enriching Module

Context Enriching Module

Context Enriching Module

newspaper_reference

newspaper_reference

newspaper_reference

Enriched Context

Enriched Context

Enriched Context

Natural Language Processing

Exact Dictionary-based Chunking Module

Exact Dictionary-based Chunking Module

Exact Dictionary-based Chunking Module

dic_gender_female

dic_gender_female

dic_gender_female

Context Integration Module

Context Integration Module

Context Integration Module

Rule-processing Module

Rule-processing Module

Rule-processing Module

Integrated Context

Integrated Context

Integrated Context

Rule

Rule

Rule

RDBMS

RDBMS

RDBMS

Raw Database

Raw Database

Raw Database

Database Module

Database Module

Database Module

Inferred Metadata

Inferred Metadata

Inferred Metadata

age_at_death

age_at_death

age_at_death

ID Assignment Module

ID Assignment Module

ID Assignment Module

Database Processing

Cleansed Database

Cleansed Database

Cleansed Database

Data-cleansing Module

Data-cleansing Module

Data-cleansing Module

year_of_death

year_of_death

year_of_death

gender

gender

gender

Statistical Analysis Module

Analysis Module

Statistical Analysis Module

Data Analyzing

breast_cancer/lung_cancer

breast_cancer/lung_cancer

breast_cancer/lung_cancer

Statistics Report

Statistics Report

Statistics Report


Data collection1

Data Collection

  • Obituary Crawler

    • Based on an online obituary search engine, ObitFinder

    • Serviced by Legacy.com, one of the largest online obituary providers for the US newspaper industry

    • 1,100 newspapers, 2005-2009 (200+ GB)

    • Covering 46 US states (AR, ND, WV, HI, WY excluded)

  • Data

    • Random selection

    • 3,572,122 online obituary articles


Data analysis

Data Analysis

  • Anticipated Biases

    • The number of cancer-related obituaries could be biased due to different prevalence of obituaries for different age groups or states

    • The proportion of obituaries including cause of death could bias the number of cancer-related obituaries

  • Correction Factors

    • Referencing the statistics from the CDC Deaths Final Report (2005-2009)

    • Incorporating cultural “openness” factor of a particular age group or a state


Correction factors

Correction Factors

  • Adjustment Ratio 1 (Age-based Obituary Distribution across States)

    • Age-based obituary distribution over states may be different from age-based death distribution over states

    • E.g., In the case of Tennessee, [#Obituary(TN)/#Obituary(US)] is 0.86%, but [#Death(TN)/#Death(US)] is 2.43%

    • Adjustment Ratio 1: for TN is 2.43/0.86 = 2.84

    • We can compute adjustment ratio 1 for each state

  • Adjustment Ratio 2 (Obituary Content Richness)

    • Proportion of Obituaries which include cause of deaths may be different depending on states

    • http://en.wikipedia.org/wiki/List_of_causes_of_death_by_rate

    • E.g., In the case of California, 20.7 % of obituaries include cause-of-death related terms; however, only 5.2 % of Alabama obituaries include cause-of-death related terms.

    • Adjustment Ratio 2: for CA is 5.2/20.7 = 0.13

    • We can compute adjustment ratio 2 for each state


Case study 1 breast cancer 6 935 female subjects

Case Study 1: Breast Cancer6,935 female subjects


Case study 1 breast cancer 6 935 female subjects1

Case Study 1: Breast Cancer6,935 female subjects


Reliability1

Reliability?


Case study 1 lung cancer 5 312 subjects

Case Study 1: Lung Cancer5,312 subjects


Case study 1 lung cancer 5 312 subjects1

Case Study 1: Lung Cancer5,312 subjects


Reliability2

Reliability?


Conclusions

Conclusions

  • Cancer mortality trends can be captured reliably in a time-efficient, cost-effective, and fully automated way by mining content that is openly available on the Internet.

  • Using breast and lung cancer as case studies, we observed that the trends discovered via web mining were very similar to those reported by NCI.

  • Proposed correction factors are useful to account for anticipated biases of statistics from obituary datasets.


  • Summary

    Summary


    Conclusion

    Conclusion


    Thank you

    Thank you

    Georgia Tourassi, PhD ([email protected])Songhua Xu, PhD ([email protected])


  • Login