1 data mining or kdd
Download
Skip this Video
Download Presentation
1. Data Mining (or KDD)

Loading in 2 Seconds...

play fullscreen
1 / 17

1. Data Mining (or KDD) - PowerPoint PPT Presentation


  • 256 Views
  • Uploaded on

1. Data Mining (or KDD). Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad). Let us find something interesting!. Why Mine Data? Scientific Viewpoint.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 1. Data Mining (or KDD)' - akira


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
1 data mining or kdd
1. Data Mining (or KDD)

Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Let us find something interesting!

why mine data scientific viewpoint
Why Mine Data? Scientific Viewpoint
  • Data collected and stored at enormous speeds (GB/hour)
    • remote sensors on a satellite
    • telescopes scanning the skies
    • microarrays generating gene expression data
    • scientific simulations generating terabytes of data
    • GIS
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
    • in classifying and segmenting data
    • in Hypothesis Formation
2 1 supervised clustering
2.1 Supervised Clustering

Ch. Eick

Attribute2

Attribute2

Attribute2

class 1

class 2

unclassified object

class 1

class 2

A

unclassified object

E

I

J

G

F

B

C

K

L

D

Attribute1

H

Attribute1

Attribute1

a. Unsupervised Clustering

b. Semi-supervised Clustering

c. Supervised Clustering

Applications of Supervised Clustering Include:

  • Learning Subclasses
  • for Region Discovery in Spatial Datasets
  • Distance Function Learning
  • Data Set Compression (reduce size of dataset by using cluster representatives)
  • Adaptive Supervised Clustering
example finding subclasses
Example: Finding Subclasses

Ch. Eick

Attribute1

Ford Trucks

:Ford

:GMC

GMC Trucks

GMC Van

Ford Vans

Ford SUV

Attribute2

GMC SUV

sc algorithms investigated
SC Algorithms Investigated
  • Representative-based Clustering Algorithms
    • Supervised Partitioning Around Medoids (SPAM).
    • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).
    • Supervised Clustering using Evolutionary Computing (SCEC)
  • Agglomerative Hierarchical Supervised Clustering (AHSC)
  • Grid-Based Supervised Clustering (GRIDSC)
    • Naïve approach
    • Hierarchical Grid-based Clustering relying on data cubes
    • Grid-based Clustering relying on density estimation techniques
2 2 spatial data mining spdm
2.2 Spatial Data Mining (SPDM)
  • SPDM := the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets.
  • Spatial patterns
    • Spatial outlier, discontinuities
      • bad traffic sensors on highways
    • Location prediction models
      • model to identify habitat of endangered species
    • Spatial clusters
      • crime hot-spots , poverty clusters
    • Co-location patterns
      • identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc.

Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.

2 3 distance function learning
2.3 Distance Function Learning

Example: How to Find Similar Patients?

Task: Construct a distance function that measures patient similarity

Motivation: Finding a “good” distance function is important for:

  • Case based reasoning
  • Clustering
  • Instance-based classification (e.g. nearest neighbor classifiers)

Our Approach: Learn distance functions based on training examples and user feedback

slide9

Motivating Example: How To Find Similar Patients?

The following relation is given (with 10000 tuples):

Patient(ssn, weight, height, cancer-sev, eye-color, age,…)

  • Attribute Domains
    • ssn: 9 digits
    • weight between 30 and 650; mweight=158 sweight=24.20
    • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2
    • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor
    • eye-color: {brown, blue, green, grey }
    • age: between 3 and 100; mage=45 sage=13.2

Task: Define Patient Similarity

idea coevolving clusters and distance functions
Idea: Coevolving Clusters and Distance Functions

Weight Updating Scheme /

Search Strategy

Clustering X

Distance

Function Q

Cluster

“Bad” distance function Q1

“Good” distance function Q2

q(X) Clustering

Evaluation

o

o

o

x

x

o

x

o

o

o

x

o

o

o

Goodness of

the Distance

Function Q

o

o

x

x

x

x

x

x

distance function learning framework
Distance Function Learning Framework

Distance Function

Evaluation

Weight-Updating Scheme /

Search Strategy

Current

Research

[CHEN05]

K-Means

[ERBV04]

Inside/Outside

Weight Updating

Supervised

Clustering

Work

By Karypis

Randomized

Hill Climbing

NN-Classifier

Adaptive

Clustering

Other

Research

[BECV05]

slide12

Ch. Eick

Clustering

Supervised Clustering

Algorithm

Summary

Inputs

Changes

Adaptation

System

Evaluation

System

Feedback

Domain

Expert

Past

Experience

Quality

Fitness

Functions

(Predefined)

q(X),

2.4 Adaptive Data Mining

2 5 signatures of data sets
2.5 Signatures of Data Sets

Input: a set of classified examples

Output: Signatures in the dataset that characterize

    • how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset
    • how many regions dominated by a single class exist in the data set
    • which regions dominated by one class are bordering regions dominated by another class?
    • where are the regions, identified in step 2 and 3, located
    • what are the density attactors (maxima of the density function) of the classes in the data set

Why are we creating those signatures?

  • As a preprocessing step to develop smarter classifiers
  • To understand why a particular data mining techniques works well / do not work well for a particular dataset  meta learning

Methods employed: density estimation techniques, supervised clustering, proximity graphs (e.g. Delaunay, Gabriel graphs),…

example signatures of data sets
Example: Signatures of Data Sets

Attribute2

Attribute2

Attribute2

class 1

class 2

unclassified object

class 1

class 2

A

unclassified object

E

I

J

G

F

B

C

K

L

D

Attribute1

H

Attribute1

Attribute1

a. Unsupervised Clustering

b. Semi-supervised Clustering

c. Supervised Clustering

slide15

Applications of Creating Signatures:

    • Class Decomposition (see also [VAE03])

Attribute 1

Attribute 1

Attribute 2

Attribute 2

Attribute 1

Attribute 2

2 6 research christoph f eick 2005 2007
2.6 Research Christoph F. Eick 2005-2007

Clustering for Classification

Creating Signatures

For Datasets

Editing /

Data Set Compression

Supervised Clustering

Distance Function

Learning

Spatial Data Mining

Adaptive Clustering

Mining Data Streams

Online Data Mining

Mining Sensor Data

Measures of Interestingness

Evolutionary

Computing

Mining Semi-Structured Data

Web Annotation

File Prediction

slide17

3. UH Data Mining and Machine Learning Group (UH-DMML)Co-Directors: Christoph F. Eick and Ricardo Vilalta

Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine.

Topics investigated:

  • Meta Learning
  • Classification and Learning from Examples
  • Clustering
  • Distance Function Learning
  • Using Reinforcement Learning for Data Mining
  • Spatial Data Mining
  • Knowledge Discovery
ad