1 data mining or kdd
Download
1 / 17

1. Data Mining (or KDD) - PowerPoint PPT Presentation


  • 255 Views
  • Uploaded on

1. Data Mining (or KDD). Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad). Let us find something interesting!. Why Mine Data? Scientific Viewpoint.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 1. Data Mining (or KDD)' - akira


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
1 data mining or kdd
1. Data Mining (or KDD)

Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Let us find something interesting!


Why mine data scientific viewpoint
Why Mine Data? Scientific Viewpoint

  • Data collected and stored at enormous speeds (GB/hour)

    • remote sensors on a satellite

    • telescopes scanning the skies

    • microarrays generating gene expression data

    • scientific simulations generating terabytes of data

    • GIS

  • Traditional techniques infeasible for raw data

  • Data mining may help scientists

    • in classifying and segmenting data

    • in Hypothesis Formation


2 1 supervised clustering
2.1 Supervised Clustering

Ch. Eick

Attribute2

Attribute2

Attribute2

class 1

class 2

unclassified object

class 1

class 2

A

unclassified object

E

I

J

G

F

B

C

K

L

D

Attribute1

H

Attribute1

Attribute1

a. Unsupervised Clustering

b. Semi-supervised Clustering

c. Supervised Clustering

Applications of Supervised Clustering Include:

  • Learning Subclasses

  • for Region Discovery in Spatial Datasets

  • Distance Function Learning

  • Data Set Compression (reduce size of dataset by using cluster representatives)

  • Adaptive Supervised Clustering


Example finding subclasses
Example: Finding Subclasses

Ch. Eick

Attribute1

Ford Trucks

:Ford

:GMC

GMC Trucks

GMC Van

Ford Vans

Ford SUV

Attribute2

GMC SUV


Sc algorithms investigated
SC Algorithms Investigated

  • Representative-based Clustering Algorithms

    • Supervised Partitioning Around Medoids (SPAM).

    • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).

    • Supervised Clustering using Evolutionary Computing (SCEC)

  • Agglomerative Hierarchical Supervised Clustering (AHSC)

  • Grid-Based Supervised Clustering (GRIDSC)

    • Naïve approach

    • Hierarchical Grid-based Clustering relying on data cubes

    • Grid-based Clustering relying on density estimation techniques


2 2 spatial data mining spdm
2.2 Spatial Data Mining (SPDM)

  • SPDM := the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets.

  • Spatial patterns

    • Spatial outlier, discontinuities

      • bad traffic sensors on highways

    • Location prediction models

      • model to identify habitat of endangered species

    • Spatial clusters

      • crime hot-spots , poverty clusters

    • Co-location patterns

      • identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc.

        Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.


Example discovery of interesting regions in wyoming census 2000 datasets
Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

Ch. Eick


2 3 distance function learning
2.3 Distance Function Learning

Example: How to Find Similar Patients?

Task: Construct a distance function that measures patient similarity

Motivation: Finding a “good” distance function is important for:

  • Case based reasoning

  • Clustering

  • Instance-based classification (e.g. nearest neighbor classifiers)

    Our Approach: Learn distance functions based on training examples and user feedback


Motivating Example: How To Find Similar Patients?

The following relation is given (with 10000 tuples):

Patient(ssn, weight, height, cancer-sev, eye-color, age,…)

  • Attribute Domains

    • ssn: 9 digits

    • weight between 30 and 650; mweight=158 sweight=24.20

    • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2

    • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

    • eye-color: {brown, blue, green, grey }

    • age: between 3 and 100; mage=45 sage=13.2

      Task: Define Patient Similarity


Idea coevolving clusters and distance functions
Idea: Coevolving Clusters and Distance Functions

Weight Updating Scheme /

Search Strategy

Clustering X

Distance

Function Q

Cluster

“Bad” distance function Q1

“Good” distance function Q2

q(X) Clustering

Evaluation

o

o

o

x

x

o

x

o

o

o

x

o

o

o

Goodness of

the Distance

Function Q

o

o

x

x

x

x

x

x


Distance function learning framework
Distance Function Learning Framework

Distance Function

Evaluation

Weight-Updating Scheme /

Search Strategy

Current

Research

[CHEN05]

K-Means

[ERBV04]

Inside/Outside

Weight Updating

Supervised

Clustering

Work

By Karypis

Randomized

Hill Climbing

NN-Classifier

Adaptive

Clustering

Other

Research

[BECV05]


Ch. Eick

Clustering

Supervised Clustering

Algorithm

Summary

Inputs

Changes

Adaptation

System

Evaluation

System

Feedback

Domain

Expert

Past

Experience

Quality

Fitness

Functions

(Predefined)

q(X),

2.4 Adaptive Data Mining


2 5 signatures of data sets
2.5 Signatures of Data Sets

Input: a set of classified examples

Output: Signatures in the dataset that characterize

  • how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset

  • how many regions dominated by a single class exist in the data set

  • which regions dominated by one class are bordering regions dominated by another class?

  • where are the regions, identified in step 2 and 3, located

  • what are the density attactors (maxima of the density function) of the classes in the data set

    Why are we creating those signatures?

  • As a preprocessing step to develop smarter classifiers

  • To understand why a particular data mining techniques works well / do not work well for a particular dataset  meta learning

    Methods employed: density estimation techniques, supervised clustering, proximity graphs (e.g. Delaunay, Gabriel graphs),…


  • Example signatures of data sets
    Example: Signatures of Data Sets

    Attribute2

    Attribute2

    Attribute2

    class 1

    class 2

    unclassified object

    class 1

    class 2

    A

    unclassified object

    E

    I

    J

    G

    F

    B

    C

    K

    L

    D

    Attribute1

    H

    Attribute1

    Attribute1

    a. Unsupervised Clustering

    b. Semi-supervised Clustering

    c. Supervised Clustering


    Attribute 1

    Attribute 1

    Attribute 2

    Attribute 2

    Attribute 1

    Attribute 2


    2 6 research christoph f eick 2005 2007
    2.6 Research Christoph F. Eick 2005-2007

    Clustering for Classification

    Creating Signatures

    For Datasets

    Editing /

    Data Set Compression

    Supervised Clustering

    Distance Function

    Learning

    Spatial Data Mining

    Adaptive Clustering

    Mining Data Streams

    Online Data Mining

    Mining Sensor Data

    Measures of Interestingness

    Evolutionary

    Computing

    Mining Semi-Structured Data

    Web Annotation

    File Prediction


    3. UH Data Mining and Machine Learning Group (UH-DMML)Co-Directors: Christoph F. Eick and Ricardo Vilalta

    Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine.

    Topics investigated:

    • Meta Learning

    • Classification and Learning from Examples

    • Clustering

    • Distance Function Learning

    • Using Reinforcement Learning for Data Mining

    • Spatial Data Mining

    • Knowledge Discovery


    ad