slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Center for Health Informatics and Bioinformatics NYU Langone Medical Center PowerPoint Presentation
Download Presentation
Center for Health Informatics and Bioinformatics NYU Langone Medical Center

Loading in 2 Seconds...

play fullscreen
1 / 20

Center for Health Informatics and Bioinformatics NYU Langone Medical Center - PowerPoint PPT Presentation

  • Uploaded on

Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability. Center for Health Informatics and Bioinformatics NYU Langone Medical Center Yin Aphinyanaphongs MD, PhD Lawrence Fu PhD Constantin Aliferis MD, PhD

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Center for Health Informatics and Bioinformatics NYU Langone Medical Center' - justin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability

Center for Health Informatics and Bioinformatics

NYU Langone Medical Center

Yin Aphinyanaphongs MD, PhD

Lawrence Fu PhD

Constantin Aliferis MD, PhD

Medinfo, Copenhagen, Denmark 08/19/2013

our vision a safer health web
Our Vision: A Safer Health Web
  • We envision an health web that provides safe, reliable, validated medical information to health consumers.
  • Focus on the lowest quality websites and building classifiers to identify those.
  • Our initial pilot targeted unproven treatments in cancer.

Image From:

cancer health consumers and the health web
Cancer, Health Consumers, and the Health Web
  • In the US, 65% of cancer pts searched unproven treatments and 12% purchased at least one unproven treatment.
  • 83% of cancer patients used at least one unproven treatment.
  • Patients with dire prognosis are more susceptible to using these unproven treatments.
  • Routine cancer search topics return sites with non-conventional treatments.
  • These sites were of variable quality. 23% discouraged the use of conventional medicine, 15% discouraged adherence to physician advice, and 26% provided anecdotal experiences.
  • Schmidt and Ernst found that nearly 10% of their sample had potentially harmful or definitely harmful material.
  • Impact of bad advice:
    • Death
    • Financial Costs with No Benefit.
    • Reduction in Efficacy of Known Treatments
    • Delayed Access to Proven Therapies

Presentation Title Goes Here

how is the health w eb policed now
How is the Health Web policed now?
  • Manual efforts
    • Rating agencies (HONCode)
    • Self ratings
    • Operation Cure.all
      • Effort led by Federal Trade Commission in 1997 and 1999 in collaboration with 80 agencies and organizations from 25 countries.
      • Identified approximately 800 to 1200 web sites purporting to cure a variety of diseases.

Presentation Title Goes Here

how do we get to a safer health web
How Do We Get to a Safer Health Web?
  • Automation
  • In our prior work, we built models that successfully identified knownunproven cancer treatments on the Internet. (where an unproven treatment is known and the goal is to identify new websites that may market the known unproven treatment, Area Under the Curve = 0.93).
  • Some questions left to consider:
    • What happens with unknownunproven treatments. (i.e treatments that are not known a priori.) Is it possible to build models that will identify these unknown unproven treatments?
    • What methods may scalea high performing machine learning model to billions of web pages?
  • This paper addresses the limitations of this prior work.

Presentation Title Goes Here

where machine learning may generalize and scale
Where MachineLearning May Generalize and Scale?

2. Generalizability

3. ModelBuilding

1. Preprocessing

4. Model


solution 1 map reduce for pre processing
Solution 1: Map Reduce for Pre-processing

Image from:

solution 2 feature selection
Solution 2: Feature Selection
  • Identifying a subset of relevant features for use in model construction that ideally maintain or improve predictivity from a dataset that includes all features.
  • Generalized Local Learning Parents Children is one such algorithm that learns a subset of features for use in model construction.
    • Will identify the smallest subset of features that gives optimal classification performance under several assumptions.

Image from: Causal Graph Based analysis of genome wide association data in rheumatoid arthritis.

area under the curve
Area Under the Curve

Centor RM. The Use of ROC Curves and Their Analyses. Med Decis Making 1991;11(2):102-106.

  • Receive Operating Curves
  • Area Under the Curve
area under the curve as a measure of performance
Area Under the Curve as a Measure of Performance

Okay Separation

Area Under the Curve ~ 0.8

Random Separation

Area Under the Curve ~ 0.5

Perfect Separation

Area Under the Curve ~ 1.0

gold standard
Gold Standard
  • 8 quack treatments identified by
  • Applied to Google appending “cancer” and “treatment.”
  • Top 30 results for each treatment labeled by the authors.
  • Two authors labeled 191 out of 240 web pages as making unproven claims or not (Inter-rater Reliability - Kappa 0.76)
  • Excluded
    • not found (404 response code) error pages
    • no content pages
    • non-English pages
    • password-protected pages
    • pdf pages
    • redirect pages
    • pages where the actual treatment text does not appear in the document
build 8 independent training testing sets
Build 8 Independent Training/ Testing Sets

Presentation Title Goes Here

experimental design
Experimental Design
  • 8 fold Leave One Treatment Out Cross Validation
  • Classifiers
    • Linear Support Vector Machines
    • L1 Regularized Logistic Regression
    • L2 Regularized Logistic Regression
    • Classifiers optimized over costs and regularization coefficient respectively.
  • Text pre-processed by term frequency – inverse document frequency weighting scheme.

Presentation Title Goes Here

results feature selection
Results – Feature Selection

*- these performances are calculated using 8 fold nested cross validation. Example documents are held out randomly without regards to the selected treatment.

computational performance point estimates
Computational Performance Point Estimates

Presentation Title Goes Here

future work
Future Work
  • Larger Sample Size.
  • Address building classification models in other languages.
  • Improved Computational Performance Estimate (not point).
  • Address adversarial nature of this application.
  • Further explore labeled sample versus ability to identify low quality websites in the full web.
  • Address efficient collection of web pages.

Presentation Title Goes Here

  • Generalization to unknownunproven treatments.
  • Evidence that training in a production environment will need a small number of labeled documents.
  • Scalability.
    • Mapreduce for pre-processing.
    • Feature Selection
      • Speed up model building.
      • Speed up model application.

Presentation Title Goes Here

  • Aphinyanaphongs, Y., & Aliferis, C. (2007). Text categorization models for identifying unproven cancer treatments on the web. Presented at the MEDINFO 2007.
  • J. Metz, “A multi-institutional study of Internet utilization by radiation oncology patients,” International Journal of Radiation Oncology Biology Physics, vol. 56, no. 4, pp. 1201–1205, Jul. 2003.
  • M. Richardson, T. Sanders, and J. Palmer, “Complementary/alternative medicine use in a comprehensive cancer center and the implications for oncology,” Journal of Clinical Oncology, 2000.
  • K. Schmidt, “Assessing websites on complementary andalternative medicine for cancer,” Annals of Oncology, vol. 15, no. 5, pp. 733–742, May 2004.
  • D. Y. KIM, H. R. LEE, and E. M. NAM, “Assessing cancer treatment related information online: unintended retrieval of complementary and alternative medicine web sites,” European Journal of Cancer Care, vol. 18, no. 1, pp. 64–68, Jan. 2009.
  • M. Hainer, N. Tsai, and S. Komura, “Fatal hepatorenal failure associated with hydrazine sulfate,” Annals of internal …, 2000.
  • J. Bromley, “Life-Threatening Interaction Between Complementary Medicines: Cyanide Toxicity Following Ingestion of Amygdalin and Vitamin C,” Annals of Pharmacotherapy, vol. 39, no. 9, pp. 1566–1569, Aug. 2005.
  • A. Sparreboom, “Herbal Remedies in the United States: Potential Adverse Interactions With Anticancer Agents,” Journal of Clinical Oncology, vol. 22, no. 12, pp. 2489–2503, Jun. 2004.
  • “ ‘Operation Cure.all’ Targets Internet Health Fraud.” Federal Trade Commission, 24-Jun-1999.
  • J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
  • C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos, “Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation,” The Journal of Machine Learning Research, vol. 11, pp. 171–234, 2010.

Presentation Title Goes Here

thank you
Thank you.
  • NYU Langone Medical Center.
  • Center for Health Informatics and Bioinformatics.
  • Center for Translational Sciences Institute
  • Grant 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health
  • Collaborators
    • Lawrence Fu
    • Constantin Aliferis
  • Contact: Yin Aphinyanaphongs (

Presentation Title Goes Here