from data mining to knowledge discovery an introduction l.
Skip this Video
Loading SlideShow in 5 Seconds..
From Data Mining to Knowledge Discovery: An Introduction PowerPoint Presentation
Download Presentation
From Data Mining to Knowledge Discovery: An Introduction

Loading in 2 Seconds...

play fullscreen
1 / 32

From Data Mining to Knowledge Discovery: An Introduction - PowerPoint PPT Presentation

  • Uploaded on

From Data Mining to Knowledge Discovery: An Introduction. Gregory Piatetsky-Shapiro KDnuggets. Outline. Introduction Data Mining Tasks Application Examples. Trends leading to Data Flood. More data is generated: Bank, telecom, other business transactions ...

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

From Data Mining to Knowledge Discovery: An Introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Introduction
  • Data Mining Tasks
  • Application Examples
trends leading to data flood
Trends leading to Data Flood
  • More data is generated:
    • Bank, telecom, other business transactions ...
    • Scientific Data: astronomy, biology, etc
    • Web, text, and e-commerce
  • More data is captured:
    • Storage technology faster and cheaper
    • DBMS capable of handling bigger DB
  • Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session
    • storage and analysis a big problem
  • Walmart reported to have 24 Tera-byte DB
  • AT&T handles billions of calls per day
    • data cannot be stored -- analysis is done on the fly
growth trends
Growth Trends
  • Moore’s law
    • Computer Speed doubles every 18 months
  • Storage law
    • total storage doubles every 9 months
  • Consequence
    • very little data will ever be looked at by a human
  • Knowledge Discovery is NEEDED to make sense and use of data.
knowledge discovery definition
Knowledge Discovery Definition

Knowledge Discovery in Data is the

non-trivial process of identifying

  • valid
  • novel
  • potentially useful
  • and ultimately understandablepatterns in data.

from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

related fields
Related Fields




Data Mining and

Knowledge Discovery













Knowledge Discovery Process



& Evaluation


Data Mining





& Cleaning









  • Introduction
  • Data Mining Tasks
  • Application Examples
data mining tasks classification
Data Mining Tasks: Classification

Learn a method for predicting the instance class from pre-labeled (classified) instances

Many approaches: Statistics,

Decision Trees, Neural Networks,


classification linear regression
Classification: Linear Regression
  • Linear Regression

w0 + w1 x + w2 y >= 0

  • Regression computes wi from data to minimize squared error to ‘fit’ the data
  • Not flexible enough
classification decision trees
Classification: Decision Trees

if X > 5 then blue

else if Y > 3 then blue

else if X > 2 then green

else blue






classification neural nets
Classification: Neural Nets
  • Can select more complex regions
  • Can be more accurate
  • Also can overfit the data – find patterns in random noise
data mining central quest
Data Mining Central Quest

Find true patterns

and avoid overfitting

(false patterns due

to randomness)

data mining tasks clustering
Data Mining Tasks: Clustering

Find “natural” grouping of instances given un-labeled data

major data mining tasks
Major Data Mining Tasks
  • Classification: predicting an item class
  • Clustering: finding clusters in data
  • Associations: e.g. A & B & C occur frequently
  • Visualization: to facilitate human discovery
  • Estimation: predicting a continuous value
  • Deviation Detection: finding changes
  • Link Analysis: finding relationships
  • Introduction
  • Data Mining Tasks
  • Application Examples
major application areas for data mining solutions


Customer Relationship Management (CRM)

Database Marketing

Fraud Detection


Health Care


Manufacturing, Process Control

Sports and Entertainment



Major Application Areas for Data Mining Solutions
case study search engines
Case Study: Search Engines
  • Early search engines used mainly keywords on a page – were subject to manipulation
  • Google success is due to its algorithm which uses mainly links to the page
  • Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google
case study direct marketing and crm
Case Study:Direct Marketing and CRM
  • Most major direct marketing companies are using modeling and data mining
  • Most financial companies are using customer modeling
  • Modeling is easier than changing customer behaviour
  • Some successes
    • Verizon Wireless reduced churn rate from 2% to 1.5%
biology molecular diagnostics
Biology: Molecular Diagnostics
  • Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML)
    • 72 samples, about 7,000 genes



  • Results: 33 correct (97% accuracy),
  • 1 error (sample suspected mislabelled)
  • Outcome predictions?
af1q new marker for medulloblastoma
AF1q: New Marker for Medulloblastoma?
  • AF1Q ALL1-fused gene from chromosome 1q
  • transmembrane protein
  • Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
case study security and fraud detection
Case Study:Security and Fraud Detection
  • Credit Card Fraud Detection
  • Money laundering
    • FAIS (US Treasury)
  • Securities Fraud
    • NASDAQ Sonar system
  • Phone fraud
    • AT&T, Bell Atlantic, British Telecom/MCI
  • Bio-terrorism detection at Salt Lake Olympics 2002
data mining and terrorism controversy in the news
Data Mining and Terrorism: Controversy in the News
  • TIA: Terrorism (formerly Total) Information Awareness Program –
    • DARPA program closed by Congress
    • some functions transferred to intelligence agencies
  • CAPPS II – screen all airline passengers
    • controversial
  • Invasion of Privacy or Defensive Shield?
criticism of analytic approach to threat detection
Criticism of analytic approach to Threat Detection:

Data Mining will

  • invade privacy
  • generate millions of false positives

But can it be effective?

can data mining and statistics be effective for threat detection
Can Data Mining and Statistics be Effective for Threat Detection?
  • Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives
  • Reality: Analytical models correlate many items of information to reduce false positives.
  • Example: Identify one biased coin from 1,000.
    • After one throw of each coin, we cannot
    • After 30 throws, one biased coin will stand out with high probability.
    • Can identify 19 biased coins out of 100 million with sufficient number of throws
another approach link analysis
Another Approach: Link Analysis

Can Find Unusual Patterns in the Network Structure

analytic technology can be effective
Analytic technology can be effective
  • Combining multiple models and link analysis can reduce false positives
  • Today there are millions of false positives with manual analysis
  • Data Mining is just one additional tool to help analysts
  • Analytic Technology has the potential to reduce the current high rate of false positives
data mining with privacy
Data Mining with Privacy
  • Data Mining looks for patterns, not people!
  • Technical solutions can limit privacy invasion
    • Replacing sensitive personal data with anon. ID
    • Give randomized outputs
    • Multi-party computation – distributed data
  • Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
the hype curve for data mining and knowledge discovery
The Hype Curve for Data Mining and Knowledge Discovery



Growing acceptance

and mainstreaming








Summary– the website for Data Mining and Knowledge Discovery

Contact: Gregory Piatetsky-Shapiro

Thank You!