- By
**akira** - Follow User

- 256 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' 1. Data Mining (or KDD)' - akira

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

1. Data Mining (or KDD)

Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Let us find something interesting!

Why Mine Data? Scientific Viewpoint

- Data collected and stored at enormous speeds (GB/hour)
- remote sensors on a satellite
- telescopes scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of data
- GIS
- Traditional techniques infeasible for raw data
- Data mining may help scientists
- in classifying and segmenting data
- in Hypothesis Formation

2.1 Supervised Clustering

Ch. Eick

Attribute2

Attribute2

Attribute2

class 1

class 2

unclassified object

class 1

class 2

A

unclassified object

E

I

J

G

F

B

C

K

L

D

Attribute1

H

Attribute1

Attribute1

a. Unsupervised Clustering

b. Semi-supervised Clustering

c. Supervised Clustering

Applications of Supervised Clustering Include:

- Learning Subclasses
- for Region Discovery in Spatial Datasets
- Distance Function Learning
- Data Set Compression (reduce size of dataset by using cluster representatives)
- Adaptive Supervised Clustering

Example: Finding Subclasses

Ch. Eick

Attribute1

Ford Trucks

:Ford

:GMC

GMC Trucks

GMC Van

Ford Vans

Ford SUV

Attribute2

GMC SUV

SC Algorithms Investigated

- Representative-based Clustering Algorithms
- Supervised Partitioning Around Medoids (SPAM).
- Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).
- Supervised Clustering using Evolutionary Computing (SCEC)
- Agglomerative Hierarchical Supervised Clustering (AHSC)
- Grid-Based Supervised Clustering (GRIDSC)
- Naïve approach
- Hierarchical Grid-based Clustering relying on data cubes
- Grid-based Clustering relying on density estimation techniques

2.2 Spatial Data Mining (SPDM)

- SPDM := the process of discovering interesting, useful, non-trivial patterns from (large) spatial datasets.
- Spatial patterns
- Spatial outlier, discontinuities
- bad traffic sensors on highways
- Location prediction models
- model to identify habitat of endangered species
- Spatial clusters
- crime hot-spots , poverty clusters
- Co-location patterns
- identify arsenic risk zones in Texas and determine if there is a correlation between the arsenic concentrations of the major Texas aquifers and cultural factors such population, farm density and the geology of the aquifers etc.

Idea: Reuse the supervised clustering algorithms that already exist by running them with a different fitness function that corresponds to a particular measure of interestingness.

Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

Ch. Eick

2.3 Distance Function Learning

Example: How to Find Similar Patients?

Task: Construct a distance function that measures patient similarity

Motivation: Finding a “good” distance function is important for:

- Case based reasoning
- Clustering
- Instance-based classification (e.g. nearest neighbor classifiers)

Our Approach: Learn distance functions based on training examples and user feedback

Motivating Example: How To Find Similar Patients?

The following relation is given (with 10000 tuples):

Patient(ssn, weight, height, cancer-sev, eye-color, age,…)

- Attribute Domains
- ssn: 9 digits
- weight between 30 and 650; mweight=158 sweight=24.20
- height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2
- cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor
- eye-color: {brown, blue, green, grey }
- age: between 3 and 100; mage=45 sage=13.2

Task: Define Patient Similarity

Idea: Coevolving Clusters and Distance Functions

Weight Updating Scheme /

Search Strategy

Clustering X

Distance

Function Q

Cluster

“Bad” distance function Q1

“Good” distance function Q2

q(X) Clustering

Evaluation

o

o

o

x

x

o

x

o

o

o

x

o

o

o

Goodness of

the Distance

Function Q

o

o

x

x

x

x

x

x

Distance Function Learning Framework

Distance Function

Evaluation

Weight-Updating Scheme /

Search Strategy

Current

Research

[CHEN05]

K-Means

[ERBV04]

Inside/Outside

Weight Updating

Supervised

Clustering

Work

By Karypis

Randomized

Hill Climbing

NN-Classifier

Adaptive

Clustering

Other

Research

…

[BECV05]

…

Clustering

Supervised Clustering

Algorithm

Summary

Inputs

Changes

Adaptation

System

Evaluation

System

Feedback

Domain

Expert

Past

Experience

Quality

Fitness

Functions

(Predefined)

q(X),

…

2.4 Adaptive Data Mining

2.5 Signatures of Data Sets

Input: a set of classified examples

Output: Signatures in the dataset that characterize

- how the examples of a class distribute (in relationship to the examples of the other classes) in the dataset
- how many regions dominated by a single class exist in the data set
- which regions dominated by one class are bordering regions dominated by another class?
- where are the regions, identified in step 2 and 3, located
- what are the density attactors (maxima of the density function) of the classes in the data set

Why are we creating those signatures?

- As a preprocessing step to develop smarter classifiers
- To understand why a particular data mining techniques works well / do not work well for a particular dataset meta learning

Methods employed: density estimation techniques, supervised clustering, proximity graphs (e.g. Delaunay, Gabriel graphs),…

Example: Signatures of Data Sets

Attribute2

Attribute2

Attribute2

class 1

class 2

unclassified object

class 1

class 2

A

unclassified object

E

I

J

G

F

B

C

K

L

D

Attribute1

H

Attribute1

Attribute1

a. Unsupervised Clustering

b. Semi-supervised Clustering

c. Supervised Clustering

Applications of Creating Signatures:

- Class Decomposition (see also [VAE03])

Attribute 1

Attribute 1

Attribute 2

Attribute 2

Attribute 1

Attribute 2

2.6 Research Christoph F. Eick 2005-2007

Clustering for Classification

Creating Signatures

For Datasets

Editing /

Data Set Compression

Supervised Clustering

Distance Function

Learning

Spatial Data Mining

Adaptive Clustering

Mining Data Streams

Online Data Mining

Mining Sensor Data

Measures of Interestingness

Evolutionary

Computing

Mining Semi-Structured Data

Web Annotation

File Prediction

3. UH Data Mining and Machine Learning Group (UH-DMML)Co-Directors: Christoph F. Eick and Ricardo Vilalta

Goal: Development of data analysis and data mining techniques and the application of these techniques to challenging problems in physics, geology, astronomy, environmental sciences, and medicine.

Topics investigated:

- Meta Learning
- Classification and Learning from Examples
- Clustering
- Distance Function Learning
- Using Reinforcement Learning for Data Mining
- Spatial Data Mining
- Knowledge Discovery

Download Presentation

Connecting to Server..