html5-img
1 / 18

Data, Databases, and Discovery

Data, Databases, and Discovery. Andy Novobilski, PhD UT Chattanooga Computer Science N ut s and Bolts Research Methods Symposium UT College of Medicine Chattanooga September 29, 2006. An Introduction to Knowledge Discovery. Data Collection Data Validation Preprocessing of Data

apria
Download Presentation

Data, Databases, and Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and BoltsResearch Methods Symposium UT College of Medicine Chattanooga September 29, 2006

  2. An Introduction to Knowledge Discovery • Data Collection • Data Validation • Preprocessing of Data • Mining the Data • Comparing Methods

  3. Data Collection … • Paper or Electronic? • Fingernet • Continuous or Discrete? • And the Understatement of the Year …Health Insurance Portability and Accountability Act of 1996The HIPAA website http://www.hipaa.org/ links to the government’s website http://aspe.hhs.gov/admnsimp/ which states“Administrative Simplification in the Health Care Industry”

  4. … And Raw Storage … • Alphanumeric Data • Excel Worksheets • Comma/Tab Delimited Text Files • XML: The Extensible Markup Language • http://www.xml.com/ • Binary Data • Images • GIF, BMP, EPS • Streaming Data • HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7) • DICOM - http://medical.nema.org/

  5. … Stored in a Relational Manner • Relational Databases • Inexpensive • MS Access • Expensive • MS SQL Server, Oracle, Sybase, … • Free (sort of … open source) • MySQL, PostgreSQL • Licensing Varies by Usage

  6. Data Validation • Patient 002 is a … • Pregnant Male ( hit the 9 instead of 0) • With Ice Water in His Veins (misplaced decimal) • Who Might or Might Not Smoke (missing data)

  7. Preprocessing the Data • Clean-up • Out of Scope vs. Out of Family • Feature Extraction • Data Aggregation • Feature Transformation • Normalization • Principle Component Analysis

  8. Turning Data into Information • Data Mining … • Clustering • Decision Trees • Neural Networks • Bayesian Networks

  9. Clustering K-Means Y N Y Y Y N N Y N N N N

  10. Decision Trees • Division of Data Based on Information Gain • White Box Gender M F Smoker Age N Y Age Y N N Y N Y Y

  11. Neural Networks • Functional Approximation to Data • Black Box • Most Common is Feed Forward, Back Propagation • Considerations in Training the Network • Many Types of Neural Networks • Difficulties with Discrete Data • Missing Data Requires Careful Consideration Case Data Forecast

  12. Bayesian Networks • Belief Networks • White Box • Causal Orientation • Beliefs are Updated Based on New Information • Nodes Can Serve as Both Evidence and Query Points • Handles Missing Data Gracefully

  13. An Example • Novobilski, Andrew, F. Fesmire, D. Sonnemaker. "Mining Bayesian Networks to Forecast Adverse Outcomes Related to Acute Coronary Syndrome." ." The 17th International FLAIRS Conference 2004.

  14. Comparing Models – The ROC Curve • The Receiver Operating Characteristic (ROC) Curve • Plots the Percentage of True Positives against the Percentage of False Positives as the Cutoff Value is varied from everyone classified as ill to everyone classified as healthy. • Provides a consistent measure of model fitness that varies between 0 and 100.

  15. An Illustration Healthy Cutoff Value Ill

  16. Comparing Multiple Classifiers

  17. In Summary … • A Process to Consider … • Collect, Validate, Preprocess, Mine, Compare • Excellent Software is Available • Both Commercial and Open Source • Sample Data Is Available

  18. Thank You ! • Questions and/or Comments are Welcome … Dr. Andy NovobilskiUT Chattanooga Computer Science 615 McCallie Ave., Dept. 2302 Chattanooga, TN 37403 (423) 425-4202 Andy-Novobilski@utc.edu http://www.utc.edu/Faculty/Andy-Novobilski

More Related