1 / 11

The KDD Process for Extracting Useful Knowledge from Volumes of Data

The KDD Process for Extracting Useful Knowledge from Volumes of Data. Fayyad, Piatetsky -Shapiro, and Smyth Ian Kim SWHIG Seminar. Overview. What can we gain from data? Business and marketing applications Public p olicy decision-making Scientific research

vicky
Download Presentation

The KDD Process for Extracting Useful Knowledge from Volumes of Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar

  2. Overview • What can we gain from data? • Business and marketing applications • Public policy decision-making • Scientific research • Why do we need the KDD process? • Increasing use of data analytics • Size of databases involved • Being able to access raw data isn’t enough

  3. The KDD Process

  4. Part 1:Selection • Formulating the target dataset • What kinds of records to consider? • Desired fields? • Incorporates domain knowledge • Background knowledge in relevant field • Goals of the dataset

  5. Part 2:Pre-processing • Preparing raw data for transformation • Removal of noise, outliers • Strategy for handling missing records • Missing/unknown value mappings

  6. Part 3:Transformation • Data reduction • Grouping to reduce number of variables considered • Aggregation to higher row unit • Useful representations of data • Summary statistics

  7. Part 4:Data Mining • Selection of data model • Summarization, classification, clustering, regression analysis • Searching for patterns in data

  8. Part 5:Interpretation • Interpreting the model used in the previous step • Check results if they make sense • Consider different models, returning to prior steps • Utilize the obtained results

  9. Challenges of KDD • Massive datasets • Algorithmic efficiency, approximation, parallel processing • Making interaction possible for analysts • Develop better tools that allow for human-computer interaction • Overfitting, measures of significance • Testing on randomly chosen sections • Missing or invalid data • Strategies to identify hidden variables and dependencies • Making data understandable by humans • Improved data visualization methods

  10. Challenges of KDD • Rapidly changing data • Incrementally updating discovered patterns • Integration • Coordinating database tools (OLAP) and data mining tools • Nonstandard data (e.g. multimedia) • “Beyond the scope of current KDD technology”

  11. Conclusion • Emerging nature of KDD & data mining fields • Human interaction still necessary • Incorporating machines to cope with scale of data • Improve tools to make better decisions using data

More Related