Genetic algorithms (GA) for clustering

1 / 35

# Genetic algorithms (GA) for clustering - PowerPoint PPT Presentation

Speech and Image Processing Unit School of Computing University of Eastern Finland. Genetic algorithms (GA) for clustering. Clustering Methods: Part 2e. Pasi Fränti. General structure. Genetic Algorithm: Generate S initial solutions REPEAT Z iterations Select best solutions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Genetic algorithms (GA) for clustering' - cassia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

Genetic algorithms (GA)for clustering

Clustering Methods: Part 2e

Pasi Fränti

General structure

Genetic Algorithm:

Generate S initial solutions

REPEAT Z iterations

Select best solutions

Create new solutions by crossover

Mutate solutions

END-REPEAT

Components of GA
• Representation of solution
• Selection method
• Crossover method
• Mutation

Most critical !

Representation of solution
• Partition (P):
• Optimal centroid can be calculated from P.
• Only local changes can be made.
• Codebook (C):
• Optimal partition can be calculated from C.
• Calculation of P takes O(NM)  slow.
• Combined (C, P):
• Both data structures are needed anyway.
• Computationally more efficient.
Selection method
• To select which solutions will be used in crossover for generating new solutions.
• Main principle: good solutions should be used rather than weak solutions.
• Two main strategies:
• Roulette wheel selection
• Elitist selection.
• Exact implementation not so important.
Roulette wheel selection
• Select two candidate solutions for the crossover randomly.
• Probability for a solution to be selected is weighted according to its distortion:
Elitist selection
• Main principle: select all possible pairs among the best candidates.

Elitist approach using

zigzag scanning among the best solutions

Crossover methods

Different variants for crossover:

• Random crossover
• Centroid distance
• Pairwise crossover
• Largest partitions
• PNN

Local fine-tuning:

• All methods give new allocation of the centroids.
• Local fine-tuning must be made by K-means.
• Two iterations of K-means is enough.
Random crossover

Select M/2 centroids randomly from the two parent.

Solution 1

Solution 2

+

c4

c4

c3

c2

c2

c3

c1

c1

2

4

5

1

8

Explanation

Data point

Centroid

M – number of clusters

Parent solution A

Parent solution B

New Solution:

How to create a new solution?

Picking M/2 randomly chosen cluster centroids from each of the two parents in turn.

How many solutions are there?

36 possibilities how to create a new solution.

What is the probability to select a good one?

Not high, some are good but K-Means is needed, most are bad. See statistics.

M = 4

Some possibilities:

Rough statistics:

Optimal: 1

Good: 7

c4

c4

c2

c3

c2

c1

c1

c3

2

4

5

1

8

c1

c1

c1

c4

c4

c4

c2

c3

c2

c2

c3

c3

Parent solution A

Parent solution B

Childsolution(optimal)

Childsolution(good)

Centroid distance crossover [Pan, McInnes, Jack, 1995: Electronics Letters ] [Scheunders, 1997: Pattern Recognition Letters ]
• For each centroid, calculate its distance to the center point of the entire data set.
• Sort the centroids according to the distance.
• Divide into two sets: central vectors (M/2 closest) and distant vectors (M/2 furthest).
• Take central vectors from one codebook and distant vectors from the other.

c4

c4

6

6

c4

5

5

Ced

c1

Ced

c4

c3

c2

c1

c3

c2

1

c2

1

c2

2

4

5

1

8

1) Distances d(ci, Ced):

A:d(c4, Ced) < d(c2, Ced)< d(c1, Ced) < d(c3, Ced) B:d(c1, Ced) < d(c3, Ced)< d(c2, Ced) < d(c4, Ced)

2) Sort centroids according to the distance:

A:c4,c2,c1, c3, B:c1, c3, c2, c4

3) Divide into two sets (M = 4):

A:central vectors: c4, c2, distant vectors:c1, c3B:central vectors:c1, c3, distant vectors:c2, c4

Explanation

c1

Data point

c3

Centroid

Centroid of entire dataset

M – number of clusters

c1

c3

Parent solution A

Parent solution B

2

4

5

1

8

New solution:

Variant (a)

Take cental vectors from parent solution A

and distant vectors from parent solution B

OR

Variant (b)

Take distant vectors from parent solution A

andcentral vectors from parent solution B

c4

c4

6

6

5

5

c3

Ced

c3

Ced

c4

c4

c2

c2

c1

c2

c1

1

1

c2

2

4

5

1

8

2

4

5

1

8

Explanation

c1

Data point

c3

Centroid

Centroid of entire dataset

M – number of clusters

c1

c3

Child - variant (a)

Child – variant (b)

New solution:

Variant (a)

Take cental vectors from parent solution A

and distant vectors from parent solution B

OR

Variant (b)

Take distant vectors from parent solution A

andcentral vectors from parent solution B

Pairwise crossover[Fränti et al, 1997: Computer Journal]

Greedy approach:

• For each centroid, find its nearest centroid in the other parent solution that is not yet used.
• Among all pairs, select one of the two randomly.

Small improvement:

• No reason to consider the parents as separate solutions.
• Take union of all centroids.
• Make the pairing independent of parent.
Pairwise crossover example

Initial parent solutions

MSE=11.92109

MSE=8.79109

Pairwise crossover example

Pairing between parent solutions

MSE=7.34109

Pairwise crossover example

Pairing without restrictions

MSE=4.76109

Largest partitions[Fränti et al, 1997: Computer Journal]
• Select centroids that represent largest clusters.
• Selection by greedy manner.
• (illustration to appear later)

Initial 1

Initial 2

Union

Combined

After PNN

PNN

Mutations
• Purpose is to implement small random changes to the solutions.
• Happens with a small probability.
• Sensible approach: change the location of one centroid by the random swap!
• Role of mutations is to simulate local search.
• If mutations are needed  crossover method is not very good.
Effect of k-means and mutations

K-means improves but not vital

Mutations alone better than random crossover!

PNN vs. IS crossovers

Further improvement of about 1%

Optimized GAIS variants

GAIS short (optimized for speed):

• Create new generations only as long as the best solution keeps improving (T=*).
• Use a small population size (Z=10)
• Apply two iterations of k‑means (G=2).

GAIS long (optimized for quality):

• Create a large number of generations (T=100)
• Large population size (Z=100)
• Iterate k‑means relatively long (G=10).
Conclusions
• Best clustering obtained by GA.
• Crossover method most important.
• Mutations not needed.
References
• P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.
• P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000.
• P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997.
• J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, 2003.
• J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook design using genetic algorithms. Electronics Letters,31, 1418-1419, August 1995.
• P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters,17, 547-556, 1996.