- 113 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Genetic algorithms (GA) for clustering' - cassia

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Speech and Image Processing UnitSchool of Computing

University of Eastern Finland

Genetic algorithms (GA)for clusteringClustering Methods: Part 2e

Pasi Fränti

General structure

Genetic Algorithm:

Generate S initial solutions

REPEAT Z iterations

Select best solutions

Create new solutions by crossover

Mutate solutions

END-REPEAT

Representation of solution

- Partition (P):
- Optimal centroid can be calculated from P.
- Only local changes can be made.
- Codebook (C):
- Optimal partition can be calculated from C.
- Calculation of P takes O(NM) slow.
- Combined (C, P):
- Both data structures are needed anyway.
- Computationally more efficient.

Selection method

- To select which solutions will be used in crossover for generating new solutions.
- Main principle: good solutions should be used rather than weak solutions.
- Two main strategies:
- Roulette wheel selection
- Elitist selection.
- Exact implementation not so important.

Roulette wheel selection

- Select two candidate solutions for the crossover randomly.
- Probability for a solution to be selected is weighted according to its distortion:

Elitist selection

- Main principle: select all possible pairs among the best candidates.

Elitist approach using

zigzag scanning among the best solutions

Crossover methods

Different variants for crossover:

- Random crossover
- Centroid distance
- Pairwise crossover
- Largest partitions
- PNN

Local fine-tuning:

- All methods give new allocation of the centroids.
- Local fine-tuning must be made by K-means.
- Two iterations of K-means is enough.

c4

c3

c2

c2

c3

c1

c1

2

4

5

1

8

Explanation

Data point

Centroid

M – number of clusters

Parent solution A

Parent solution B

New Solution:

How to create a new solution?

Picking M/2 randomly chosen cluster centroids from each of the two parents in turn.

How many solutions are there?

36 possibilities how to create a new solution.

What is the probability to select a good one?

Not high, some are good but K-Means is needed, most are bad. See statistics.

M = 4

Some possibilities:

Rough statistics:

Optimal: 1

Good: 7

Bad: 28

c4

c2

c3

c2

c1

c1

c3

2

4

5

1

8

c1

c1

c1

c4

c4

c4

c2

c3

c2

c2

c3

c3

Parent solution A

Parent solution B

Childsolution(optimal)

Childsolution(good)

Childsolution(bad)

Centroid distance crossover [Pan, McInnes, Jack, 1995: Electronics Letters ] [Scheunders, 1997: Pattern Recognition Letters ]

- For each centroid, calculate its distance to the center point of the entire data set.
- Sort the centroids according to the distance.
- Divide into two sets: central vectors (M/2 closest) and distant vectors (M/2 furthest).
- Take central vectors from one codebook and distant vectors from the other.

c4

6

6

c4

5

5

Ced

c1

Ced

c4

c3

c2

c1

c3

c2

1

c2

1

c2

2

4

5

1

8

1) Distances d(ci, Ced):

A:d(c4, Ced) < d(c2, Ced)< d(c1, Ced) < d(c3, Ced) B:d(c1, Ced) < d(c3, Ced)< d(c2, Ced) < d(c4, Ced)

2) Sort centroids according to the distance:

A:c4,c2,c1, c3, B:c1, c3, c2, c4

3) Divide into two sets (M = 4):

A:central vectors: c4, c2, distant vectors:c1, c3B:central vectors:c1, c3, distant vectors:c2, c4

Explanation

c1

Data point

c3

Centroid

Centroid of entire dataset

M – number of clusters

c1

c3

Parent solution A

Parent solution B

2

4

5

1

8

New solution:

Variant (a)

Take cental vectors from parent solution A

and distant vectors from parent solution B

OR

Variant (b)

Take distant vectors from parent solution A

andcentral vectors from parent solution B

c4

6

6

5

5

c3

Ced

c3

Ced

c4

c4

c2

c2

c1

c2

c1

1

1

c2

2

4

5

1

8

2

4

5

1

8

Explanation

c1

Data point

c3

Centroid

Centroid of entire dataset

M – number of clusters

c1

c3

Child - variant (a)

Child – variant (b)

New solution:

Variant (a)

Take cental vectors from parent solution A

and distant vectors from parent solution B

OR

Variant (b)

Take distant vectors from parent solution A

andcentral vectors from parent solution B

Pairwise crossover[Fränti et al, 1997: Computer Journal]

Greedy approach:

- For each centroid, find its nearest centroid in the other parent solution that is not yet used.
- Among all pairs, select one of the two randomly.

Small improvement:

- No reason to consider the parents as separate solutions.
- Take union of all centroids.
- Make the pairing independent of parent.

Largest partitions[Fränti et al, 1997: Computer Journal]

- Select centroids that represent largest clusters.
- Selection by greedy manner.
- (illustration to appear later)

PNN crossover for GA[Fränti et al, 1997: The Computer Journal]

Initial 1

Initial 2

Union

Combined

After PNN

PNN

The PNN crossover method (1)[Fränti, 2000: Pattern Recognition Letters]

Effect of crossover method(with k-means iterations)

Binary data (Bridge2)

Mutations

- Purpose is to implement small random changes to the solutions.
- Happens with a small probability.
- Sensible approach: change the location of one centroid by the random swap!
- Role of mutations is to simulate local search.
- If mutations are needed crossover method is not very good.

Effect of k-means and mutations

K-means improves but not vital

Mutations alone better than random crossover!

PNN vs. IS crossovers

Further improvement of about 1%

Optimized GAIS variants

GAIS short (optimized for speed):

- Create new generations only as long as the best solution keeps improving (T=*).
- Use a small population size (Z=10)
- Apply two iterations of k‑means (G=2).

GAIS long (optimized for quality):

- Create a large number of generations (T=100)
- Large population size (Z=100)
- Iterate k‑means relatively long (G=10).

Conclusions

- Best clustering obtained by GA.
- Crossover method most important.
- Mutations not needed.

References

- P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.
- P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000.
- P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997.
- J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, 2003.
- J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook design using genetic algorithms. Electronics Letters,31, 1418-1419, August 1995.
- P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters,17, 547-556, 1996.

Download Presentation

Connecting to Server..