Basic concepts of data mining clustering and genetic algorithms l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Basic concepts of Data Mining, Clustering and Genetic Algorithms PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on
  • Presentation posted in: General

Basic concepts of Data Mining, Clustering and Genetic Algorithms. Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo. Data Mining Motivation. Mechanical production of data need for mechanical consumption of data Large databases = vast amounts of information

Download Presentation

Basic concepts of Data Mining, Clustering and Genetic Algorithms

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Basic concepts of data mining clustering and genetic algorithms l.jpg

Basic concepts of Data Mining, Clustering and Genetic Algorithms

Tsai-Yang Jea

Department of

Computer Science and Engineering

SUNY at Buffalo


Data mining motivation l.jpg

Data Mining Motivation

  • Mechanical production of data need for mechanical consumption of data

  • Large databases = vast amounts of information

  • Difficulty lies in accessing it


Kdd and data mining l.jpg

KDD and Data Mining

  • KDD: Extraction of knowledge from data

    • “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data”

  • Data Mining: Discovery stage of the KDD process


Data mining techniques l.jpg

Query tools

Statistical techniques

Visualization

On-line analytical processing (OLAP)

Clustering

Classification

Decision trees

Association rules

Neural networks

Genetic algorithms

Data Mining Techniques

Any technique that helps to extract more out of data is useful


What s clustering l.jpg

What’s Clustering

  • Clustering is a kind of unsupervised learning.

  • Clustering is a method of grouping data that share similar trend and patterns.

  • Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.

    • Example:

After clustering:

Thus, we see clustering means grouping of data or dividing a large

data set into smaller data sets of some similarity.


The usage of clustering l.jpg

The usage of clustering

  • Some engineering sciences such as pattern recognition, artificial intelligence have been using the concepts of cluster analysis. Typical examples to which clustering has been applied include handwritten characters, samples of speech, fingerprints, and pictures.

  • In the life sciences (biology, botany, zoology, entomology, cytology, microbiology), the objects of analysis are life forms such as plants, animals, and insects. The clustering analysis may range from developing complete taxonomies to classification of the species into subspecies. The subspecies can be further classified into subspecies.

  • Clustering analysis is also widely used in information, policy and decision sciences. The various applications of clustering analysis to documents include votes on political issues, survey of markets, survey of products, survey of sales programs, and R & D.


A clustering example l.jpg

A Clustering Example

Income: High

Children:1

Car:Luxury

Income: Medium

Children:2

Car:Truck

Cluster 1

Car: Sedan and

Children:3

Income: Medium

Income: Low

Children:0

Car:Compact

Cluster 4

Cluster 3

Cluster 2


Different ways of representing clusters l.jpg

(a)

d

d

e

a

a

c

j

j

k

k

h

h

b

f

i

g

(d)

(c)

1

2

3

a

0.4

0.1

0.5

b

0.1

0.8

0.1

g

a

c

i

e

d

k

b

j

f

h

c

0.3

0.3

0.4

...

Different ways of representing clusters

(b)

e

c

b

f

i

g


K means clustering iterative distance based clustering l.jpg

K Means Clustering(Iterative distance-based clustering)

  • K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.


K means clustering cont l.jpg

K means clustering(Cont.)

Select the k cluster centers randomly.

Loop until the

change in cluster

means is less the

amount specified

by the user.

Store the k cluster centers.


The drawbacks of k means clustering l.jpg

The drawbacks of K-means clustering

  • The final clusters do not represent a global optimization result but only the local one, and complete different final clusters can arise from difference in the initial randomly chosen cluster centers. (fig. 1)

  • We have to know how many clusters we will have at the first.


Drawback of k means clustering cont l.jpg

Drawback of K-means clustering(Cont.)

Figure 1


Clustering with genetic algorithm l.jpg

Clustering with Genetic Algorithm

  • Introduction of Genetic Algorithm

  • Elements consisting GAs

  • Genetic Representation

  • Genetic operators


Introduction of gas l.jpg

Introduction of GAs

  • Inspired by biological evolution.

  • Many operators mimic the process of the biological evolution including

    • Natural selection

    • Crossover

    • Mutation


Elements consisting gas l.jpg

Elements consisting GAs

  • Individual (chromosome):

    • feasible solution in an optimization problem

  • Population

    • Set of individuals

    • Should be maintained in each generation


Elements consisting gas16 l.jpg

Elements consisting GAs

  • Genetic operators. (crossover, mutation…)

  • Define the fitness function.

    • The fitness function takes a single chromosome as input and returns a measure of the goodness of the solution represented by the chromosome.


Genetic representation l.jpg

Genetic Representation

  • The most important starting point to develop a genetic algorithm

  • Each gene has its special meaning

  • Based on this representation, we can define

    • fitness evaluation function,

    • crossover operator,

    • mutation operator.


Genetic representation cont l.jpg

Wind

PlayTennis

Outlook

If Outlook is

Overcast or Rain

and

Wind is Strong,

then

PlayTennis = Yes

0

1

1

1

0

1

0

Sunny

Overcast

Strong

Yes

Rain

Normal

No

0

1

0

0

1

1

1

A chromosome

Genetic Representation (Cont.)

  • Examples 1

Gene

Allele value


Genetic representation cont19 l.jpg

Genetic Representation (Cont.)

  • Examples 2 ( In clustering problem)

    • Each chromosome represents a set of clusters; each gene represents an object; each allele value represents a cluster. Genes with the same allele value are in the same cluster.

3

5

5

1

2

1

4

A

B

C

D

E

F

G


Crossover l.jpg

Crossover

  • Exchange features of two individuals to produce two offspring (children)

  • Selected mates may have good properties to survive in next generations

  • So, we can expect that exchanging features may produce other good individuals


Crossover cont l.jpg

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

0

0

1

1

0

1

0

0

1

1

Crossover (cont.)

  • Single-point Crossover

  • Two-point Crossover

  • Uniform Crossover

1

1

1

0

1

0

0

1

0

0

0

1

1

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

0

1

0

0

0

1

1

1

0

1

0

0

1

0

0

0

1

1

0

0

1

0

1

1

0

0

0

0

1

0

1

0

1

0

0

1

0

1

0

0

0

1

0

1

Crossover template

1

1

1

0

1

0

0

1

0

0

0

1

0

0

0

1

0

0

0

1

0

0

0

1

0

1

0

1

0

1

1

0

1

0

1

1

0

0

1


Mutation l.jpg

0

0

1

1

1

1

0

1

1

1

Mutation

  • Usually change a single bit in a bit string

  • This operator should happen with very low probability.

Mutation point

(random)


Typical procedures l.jpg

1

0

0

0

1

1

1

0

1

1

0

0

0

1

1

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

1

1

0

0

1

1

0

1

0

0

0

0

0

0

0

0

1

0

0

0

1

0

0

1

1

1

1

1

1

1

0

1

1

0

1

1

1

1

Typical Procedures

  • Crossover mates are probabilistically selected based on their fitness value.

Mutation point

(random)

Crossover point

randomly selected

oldgeneration

Probabilistically select individuals

newgeneration


How to apply ga on a clustering problem l.jpg

1

1

1

3

1

2

3

2

3

2

3

3

3

3

4

3

3

3

4

3

5

5

5

5

5

How to apply GA on a clustering problem

  • Preparing the chromosomes

  • Defining genetic operators

    • Fusion: takes two unique allele values and combines them into a single allele value, combining two clusters into one.

    • Fission: takes a single allele value and gives it a different random allele value, breaking a cluster apart.

  • Defining fitness functions


Example cont l.jpg

2

1

2

1

2

1

2

1

2

2

2

2

1

4

1

2

2

1

2

2

3

3

3

3

3

3

3

2

3

3

4

4

3

4

3

3

2

3

5

1

5

5

5

5

5

5

5

4

5

5

Example: (Cont.)

Old generation

Crossover

Mutation

Fusion

Fission

Select the chromosomes

according to the fitness

function.

New generation


Finally l.jpg

Finally…

Thank You


  • Login