1 data mining or kdd
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

1. Data Mining (or KDD) PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on
  • Presentation posted in: General

1. Data Mining (or KDD). Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad). Let us find something interesting!. Why Mine Data? Scientific Viewpoint.

Download Presentation

1. Data Mining (or KDD)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


1 data mining or kdd

1. Data Mining (or KDD)

Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Let us find something interesting!


Why mine data scientific viewpoint

Why Mine Data? Scientific Viewpoint

  • Data collected and stored at enormous speeds (GB/hour)

    • remote sensors on a satellite

    • telescopes scanning the skies

    • microarrays generating gene expression data

    • scientific simulations generating terabytes of data

    • GIS

  • Traditional techniques infeasible for raw data

  • Data mining may help scientists

    • in classifying and segmenting data

    • in Hypothesis Formation


2 1 supervised clustering

2.1 Supervised Clustering

Ch. Eick

Attribute2

Attribute2

Attribute2

class 1

class 2

unclassified object

class 1

class 2

A

unclassified object

E

I

J

G

F

B

C

K

L

D

Attribute1

H

Attribute1

Attribute1

a. Unsupervised Clustering

b. Semi-supervised Clustering

c. Supervised Clustering

Applications of Supervised Clustering Include:

  • Learning Subclasses

  • for Region Discovery in Spatial Datasets

  • Distance Function Learning

  • Data Set Compression (reduce size of dataset by using cluster representatives)

  • Adaptive Supervised Clustering


Example finding subclasses

Example: Finding Subclasses

Ch. Eick

Attribute1

Ford Trucks

:Ford

:GMC

GMC Trucks

GMC Van

Ford Vans

Ford SUV

Attribute2

GMC SUV


Sc algorithms investigated

SC Algorithms Investigated

  • Representative-based Clustering Algorithms

    • Supervised Partitioning Around Medoids (SPAM).

    • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).

    • Supervised Clustering using Evolutionary Computing (SCEC)

  • Agglomerative Hierarchical Supervised Clustering (AHSC)

  • Grid-Based Supervised Clustering (GRIDSC)

    • Naïve approach

    • Hierarchical Grid-based Clustering relying on data cubes

    • Grid-based Clustering relying on density estimation techniques


2 2 spatial data mining spdm

2.2 Spatial Data Mining (SPDM)

  • SPDM := the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets.

  • Spatial patterns

    • Spatial outlier, discontinuities

      • bad traffic sensors on highways

    • Location prediction models

      • model to identify habitat of endangered species

    • Spatial clusters

      • crime hot-spots , poverty clusters

    • Co-location patterns

      • identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc.

        Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.


Example discovery of interesting regions in wyoming census 2000 datasets

Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

Ch. Eick


2 3 distance function learning

2.3 Distance Function Learning

Example: How to Find Similar Patients?

Task: Construct a distance function that measures patient similarity

Motivation: Finding a “good” distance function is important for:

  • Case based reasoning

  • Clustering

  • Instance-based classification (e.g. nearest neighbor classifiers)

    Our Approach: Learn distance functions based on training examples and user feedback


1 data mining or kdd

Motivating Example: How To Find Similar Patients?

The following relation is given (with 10000 tuples):

Patient(ssn, weight, height, cancer-sev, eye-color, age,…)

  • Attribute Domains

    • ssn: 9 digits

    • weight between 30 and 650; mweight=158 sweight=24.20

    • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2

    • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

    • eye-color: {brown, blue, green, grey }

    • age: between 3 and 100; mage=45 sage=13.2

      Task: Define Patient Similarity


Idea coevolving clusters and distance functions

Idea: Coevolving Clusters and Distance Functions

Weight Updating Scheme /

Search Strategy

Clustering X

Distance

Function Q

Cluster

“Bad” distance function Q1

“Good” distance function Q2

q(X) Clustering

Evaluation

o

o

o

x

x

o

x

o

o

o

x

o

o

o

Goodness of

the Distance

Function Q

o

o

x

x

x

x

x

x


Distance function learning framework

Distance Function Learning Framework

Distance Function

Evaluation

Weight-Updating Scheme /

Search Strategy

Current

Research

[CHEN05]

K-Means

[ERBV04]

Inside/Outside

Weight Updating

Supervised

Clustering

Work

By Karypis

Randomized

Hill Climbing

NN-Classifier

Adaptive

Clustering

Other

Research

[BECV05]


1 data mining or kdd

Ch. Eick

Clustering

Supervised Clustering

Algorithm

Summary

Inputs

Changes

Adaptation

System

Evaluation

System

Feedback

Domain

Expert

Past

Experience

Quality

Fitness

Functions

(Predefined)

q(X),

2.4 Adaptive Data Mining


2 5 signatures of data sets

2.5 Signatures of Data Sets

Input: a set of classified examples

Output: Signatures in the dataset that characterize

  • how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset

  • how many regions dominated by a single class exist in the data set

  • which regions dominated by one class are bordering regions dominated by another class?

  • where are the regions, identified in step 2 and 3, located

  • what are the density attactors (maxima of the density function) of the classes in the data set

    Why are we creating those signatures?

  • As a preprocessing step to develop smarter classifiers

  • To understand why a particular data mining techniques works well / do not work well for a particular dataset  meta learning

    Methods employed: density estimation techniques, supervised clustering, proximity graphs (e.g. Delaunay, Gabriel graphs),…


  • Example signatures of data sets

    Example: Signatures of Data Sets

    Attribute2

    Attribute2

    Attribute2

    class 1

    class 2

    unclassified object

    class 1

    class 2

    A

    unclassified object

    E

    I

    J

    G

    F

    B

    C

    K

    L

    D

    Attribute1

    H

    Attribute1

    Attribute1

    a. Unsupervised Clustering

    b. Semi-supervised Clustering

    c. Supervised Clustering


    1 data mining or kdd

    • Applications of Creating Signatures:

      • Class Decomposition (see also [VAE03])

    Attribute 1

    Attribute 1

    Attribute 2

    Attribute 2

    Attribute 1

    Attribute 2


    2 6 research christoph f eick 2005 2007

    2.6 Research Christoph F. Eick 2005-2007

    Clustering for Classification

    Creating Signatures

    For Datasets

    Editing /

    Data Set Compression

    Supervised Clustering

    Distance Function

    Learning

    Spatial Data Mining

    Adaptive Clustering

    Mining Data Streams

    Online Data Mining

    Mining Sensor Data

    Measures of Interestingness

    Evolutionary

    Computing

    Mining Semi-Structured Data

    Web Annotation

    File Prediction


    1 data mining or kdd

    3. UH Data Mining and Machine Learning Group (UH-DMML)Co-Directors: Christoph F. Eick and Ricardo Vilalta

    Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine.

    Topics investigated:

    • Meta Learning

    • Classification and Learning from Examples

    • Clustering

    • Distance Function Learning

    • Using Reinforcement Learning for Data Mining

    • Spatial Data Mining

    • Knowledge Discovery


  • Login