Software clustering using bunch
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

Software Clustering Using Bunch PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Software Clustering Using Bunch. Software Clustering Environment. Source Code. Source. Query. Source. Analyzer. Code. Script. Code. (e.g., Chava). Database. (e.g., AWK). Module. Dependency. Graph. Graph Layout &. Bunch. Clustered. Visualization. Output. Clustering. Graph.

Download Presentation

Software Clustering Using Bunch

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Software clustering using bunch

Software ClusteringUsing Bunch

Reverse Engineering (Bunch)


Software clustering environment

Software Clustering Environment

Source Code

Source

Query

Source

Analyzer

Code

Script

Code

(e.g., Chava)

Database

(e.g., AWK)

Module

Dependency

Graph

Graph Layout &

Bunch

Clustered

Visualization

Output

Clustering

Graph

Utility

File

Tool

(e.g., dotty)

Reverse Engineering (Bunch)


The structure of a graphical editor

The Structure of a Graphical Editor

main

Event

boxParserClass

boxClass

error

stackOfAnyWithTop

boxScannerClass

Fonts

Globals

Mathlib

EdgeClass

ColorTable

stackOfAny

Lexer

hashedBoxes

hashGlobals

GenerateProlog

NodeOfAny

Reverse Engineering (Bunch)


The software clustering problem

The Software Clustering Problem

  • “Find a good partition of the Software Structure.”

  • A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters.

  • A goodpartition is a partition where:

    • highly interdependent nodes are grouped in the same clusters

    • independent nodes are assigned to separate clusters

Reverse Engineering (Bunch)


Not all partitions are created equal

N4

N1

N3

N2

N5

N6

Not all Partitions are Created Equal ...

N1

Good Partition!

N2

N4

N1

Bad Partition!

N2

N4

N3

N5

N3

N5

N6

N6

Reverse Engineering (Bunch)


Our assumption

Our Assumption

  • “Well designed software systems are organized into cohesive clusters that are loosely interconnected.”

Reverse Engineering (Bunch)


Problem there are too many partitions of the mdg

=

Ú

=

ì

1

if

k

1

k

n

=

S

í

+

,

n

k

S

kS

otherwise

î

-

-

-

1

,

1

1

,

n

k

n

k

Problem: There are too many partitions of the MDG…

The number of MDG partitions grows very quickly, as the number of modules in the system increases…

1 = 1

2 = 2

3 = 5

4 = 15

5 = 52

6 = 203

7 = 877

8 = 4140

9 = 21147

10 = 115975

11 = 678570

12 = 4213597

13 = 27644437

14 = 190899322

15 = 1382958545

16 = 10480142147

17 = 82864869804

18 = 682076806159

19 = 5832742205057

20 = 51724158235372

A 15 Module System is about the limit for performing Exhaustive Analysis

Reverse Engineering (Bunch)


Edge types

Edge Types

  • With respect to each cluster, there are two different kinds of edges:

    • edges (Intra-Edges) which are edges that start and end within the same cluster

    • edges (Inter-Edges) which are edges that start and end in different clusters

CLUSTER

Other

Clusters

a

b

c

Reverse Engineering (Bunch)


A partition of the structure of the graphical editor

A Partition of the Structure of the Graphical Editor

Main

Generate

Prolog

boxParserClass

Event

boxScanner

Class

Globals

boxClass

error

stackOfAnyWithTop

Lexer

stackOfAny

hashedBoxes

edgaClass

MathLib

colorTable

Fonts

NodeOfAny

hashGlobals

Reverse Engineering (Bunch)


Example the structure of the dot system

Example: The Structure of the dot System

Reverse Engineering (Bunch)


Example the structure of the dot system1

Example: The Structure of the dot System

  • The Module Dependency Graph (MDG)

  • We represent the structure of the system as a graph, called the MDG

    • The nodes are the modules/classes

    • The edges are the relations between the modules/classes

Reverse Engineering (Bunch)


Example the structure of the dot system2

Example: The Structure of the dot System

Dot is a reasonably small system, but its structure is complex and hard to understand…

Reverse Engineering (Bunch)


The dot system after clustering and filtering

The dot System after Clustering and Filtering

The clustered viewhighlights…

Identification of “special” modules(relations are hidden to improve clarity)

Subsystems that group common features

Modules From the Source Code

Relations Between the Modules

Reverse Engineering (Bunch)


The dot system after clustering and filtering1

The dot System after Clustering and Filtering

  • The Partitioned Module Dependency Graph (MDG)

  • The partitioned MDG contains clustersthat group related nodes from the MDG.

    • The clusters represent the subsystems

    • The subsystems highlight high-levelfeatures or services provided by the modules in the cluster.

Reverse Engineering (Bunch)


The dot system after clustering and filtering2

The dot System after Clustering and Filtering

The clustered view of dot (its partitioned MDG) is easier to understand than the unclustered view…

Reverse Engineering (Bunch)


Step 1 creating the mdg

Step 1: Creating the MDG

Example: The MDG for Apache’sRegular Expression class library

Source Code

void main(){printf(“hello”);}

Source Code

Analysis Tools

Acacia

Chava

  • The MDG can be generated automatically using source code analysis tools

  • Nodes are the modules/classes, edges represent source-code relations

  • Edge weights can be established in many ways, and different MDGscan be created depending on the types of relations considered

Reverse Engineering (Bunch)


Step 1 creating the mdg1

Step 1: Creating the MDG

Example: The MDG for Apache’sRegular Expression class library

Source Code

void main(){printf(“hello”);}

Automatic MDG Generation

We provide scripts on the SERG webpage to create MDGs automatically forsystems developed in C, C++, andJava.

The MDG eliminates details associatedwith particular programming languages

Source Code

Analysis Tools

Acacia

Chava

  • The MDG can be generated automatically using source code analysis tools

  • Nodes are the modules/classes, edges represent source-code relations

  • Edge weights can be established in many ways, and different MDGscan be created depending on the types of relations considered

Reverse Engineering (Bunch)


Why do we want to cluster

Why do we want to Cluster?

  • To group components of a system.

  • Grouping gives some semantic information

    • We find out which components are strongly related to each other.

    • We find out which components are not so strongly connected to each other.

    • This gives us an idea of the logical subsystem structure of the entire system.

Reverse Engineering (Bunch)


Clustering useful in practice

Clustering useful in practice?

  • Documentation may not exist.

  • As software grows, the documentation doesn’t always keep up to date.

  • There are constant changes in who uses and develops a system.

Reverse Engineering (Bunch)


Our approach to automatic clustering

Our Approach to Automatic Clustering

  • “Treat automatic clustering as a searching problem”

    • Maximize an objective function that formally quantifies the “quality” of an MDG partition.

    • We refer to the value of the objective function as the modularization quality (MQ)

MQ is a Measurement and not a Metric

Reverse Engineering (Bunch)


Why automatic clustering

Why automatic clustering?

  • Experts are not always available.

  • Gives the previously mentioned benefits without costing much.

Reverse Engineering (Bunch)


How does clustering help

How does clustering help?

  • Helps create models of software in the absence of experts.

    • Gives the new developers a ‘road map’ of the software’s structure.

    • Helps developers understand legacy systems.

    • Lets developers compare documented structure with some ‘inherent’ structure of the system.

Reverse Engineering (Bunch)


Module dependency graph

Module Dependency Graph

  • Represents the dependencies between the components of a system.

  • Nodes are components:

    • Classes, modules, files, packages, functions…

  • Edges are relationships between components:

    • Inherit, call, instantiate, etc…

Reverse Engineering (Bunch)


Software clustering problem

Software Clustering Problem

  • For a given mdg, and a function that determines the quality of a partitioning of that mdg, find the best partition.

Reverse Engineering (Bunch)


Possible answer

Possible Answer?

  • We could look at every possible partition…

    • Exhaustive search always gives the best answer.

    • The number of partitions of a set are equal to the recurrence given by Sterling:

      S(n, k) =1 if k = 1 or k = n

      S(n – 1, k – 1) + k S(n – 1, k)

Reverse Engineering (Bunch)


Exhaustive search

Exhaustive Search…

  • If Exhaustive search is optimal, then why not use it?

    • The recurrence to the Sterling recurrence grows very rapidly.

    • The search for partitioning for n nodes must go through partitions for all k in S(n, k). Thus the size of the search space is:

       S(n, k), for k = 1..n … enormous

Reverse Engineering (Bunch)


Other answers

Other Answers

  • If we don’t use Exhaustive search, then we have to use a sub-optimal strategy.

  • Just because a strategy is sub-optimal, though, doesn’t mean it won’t give good results.

Reverse Engineering (Bunch)


Software clustering with search algorithms

SEARCH SPACESet of All MDG Partitions

Software ClusteringSearch Algorithms

Source Code

void main(){printf(“hello”);}

bP = null;

while(searching()){p = selectNext(); if(p.isBetter(bP))

bP = p;}

return bP;

M1

M6

M3

M2

M8

M7

Source Code

Analysis Tools

M4

M5

Acacia

Chava

M6

M1

M3

M8

M7

MDG

M2

“GOOD” MDG Partition

M1

M3

M6

M4

M5

M1

M3

M6

M2

M7

M8

Total = 4140 Partitions

M2

M4

M5

M7

M8

M4

M5

Software Clustering with Search Algorithms

Reverse Engineering (Bunch)


Software clustering with search algorithms1

SEARCH SPACESet of All MDG Partitions

Software ClusteringSearch Algorithms

Source Code

void main(){printf(“hello”);}

bP = null;

while(searching()){p = selectNext(); if(p.isBetter(bP))

bP = p;}

return bP;

M1

M6

M3

M2

M8

M7

Source Code

Analysis Tools

M4

M5

Acacia

Chava

M6

M1

M3

M8

M7

MDG File

M2

“GOOD” MDG Partition

M1

M3

M6

M4

M5

M1

M3

M6

M2

M7

M8

Total = 4140 Partitions

M2

M4

M5

M7

M8

M4

M5

Software Clustering with Search Algorithms

  • Search Algorithm Requirements

  • Must be able to compare one partition to another objectively.

  • We define the Modularization Quality(MQ) measurement to meet this goal.

    • Given partitions P1 & P2, MQ(P1) > MQ(P2) means that P1 “is better than” P2

Reverse Engineering (Bunch)


Hill climbing

Hill-Climbing

  • Hill-Climbing refers to how to travel along the contour of the solution space.

  • The idea is to find a starting position, and then move to a better state.

    • What if there is more than one state that we could move to?

    • What happens when we find a local maximum?

    • Can we do anything about getting a bad start?

Reverse Engineering (Bunch)


Bunch hill climbing clustering algorithm

A neighborpartition iscreated byaltering thecurrentpartition slightly.

Neighbor

Partition

Generate

Next

Neighbor

Measure

MQ

Current

Partition

Measure MQ

New Best

Neighboring Partition

Compare to Best

Neighboring Partition

Better

Better?

Best Neighboring Partition for Iteration

Convergence

Best Neighboring Partition

Bunch Hill Climbing Clustering Algorithm

Generate a Random Decomposition of MDG

Iteration Step

Reverse Engineering (Bunch)


Bunch hill climbing clustering algorithm1

A neighborpartition iscreated byaltering thecurrentpartition slightly.

Neighbor

Partition

Generate

Next

Neighbor

Measure

MQ

Current

Partition

Measure MQ

New Best

Neighboring Partition

Compare to Best

Neighboring Partition

Better

Better?

Best Neighboring Partition for Iteration

Convergence

Best Neighboring Partition

Bunch Hill Climbing Clustering Algorithm

Generate a Random Decomposition of MDG

Iteration Step

  • Hill-Climbing Algorithm Features

  • We have implemented a family ofhill-climbing algorithms

  • Control the “steepness” of the climb

  • Simulated Annealing

  • Population Size

Reverse Engineering (Bunch)


Genetic algorithms

Genetic Algorithms

  • Genetic Algorithms take a problem, and model solution finding after nature.

    • Selection

    • Crossovers

    • Mutations

  • Genetic Algorithms tend not to get stuck at local maxima.

Reverse Engineering (Bunch)


Bunch genetic clustering algorithm ga

RANDOM

SELECTION

FavorPartitionswith LargerMQ Valuesfor CrossoverOperation

Next

Population

Current

Population

P1

P1

P1

P1

MutationOperation

P2

P2

P2

P2

P2

P3

P3

Mutate(Alter)a Small

Number ofPartitions

RANDOM

SELECTION

P3

P3

Pn

Pn

Pn

Pn

Next Generation

All Generations Processed

Best Partition from Final Population

Bunch Genetic Clustering Algorithm (GA)

Generate a Starting Population from the MDG

Iteration Step

CrossoverOperation

Reverse Engineering (Bunch)


Tools for clustering

Tools for Clustering

  • Bunch – Automatic Clustering

    • Performs automatic clustering.

    • Automatically produced results are often suboptimal.

    • Allows user-directed clustering:

      • Users can help the search for a good clustering with their knowledge.

Reverse Engineering (Bunch)


Using bunch

Using Bunch…

  • Bunch is a tool written in Java that will work on any machine supporting the JVM.

  • The first step in using Bunch will be finding out how to make those mdg files mentioned earlier.

Reverse Engineering (Bunch)


Creating mdg files

Creating mdg files…

  • To create an mdg, we use the mdg script:

    • Used to create mdg files for Java, C, and C++.

    • Can view various relations (extends, implements, method-variable, method-method).

    • Can also give weights to each relation, along with providing its type.

Reverse Engineering (Bunch)


Syntax

Syntax

mdg –(java|c|c++) –(eivmA)+ -(wltA)*

  • The first set of parameters represents the language to use.

  • The second set of parameters describes the types of relationship to display.

  • The last set determines what extra information to produce.

Reverse Engineering (Bunch)


Mdg results

MDG Results

  • The mdg script produces a series of pairs that represent directed edges.

  • These edges can have extra information along with them, such as weight and type, if desired.

Reverse Engineering (Bunch)


Bunch features

Bunch Features

  • Bunch is a tool to perform automatic clustering of module dependency graphs.

    • Has various algorithms for finding a good solution.

    • Allows users to assist in automatic clustering.

    • Provides methods to exclude omnipresent libraries and consumers in clustering (provides a more succinct clustering).

Reverse Engineering (Bunch)


The bunch algorithms

The Bunch Algorithms

  • Provides use for exhaustive search.

  • Has hill-climbing algorithms

    • Steepest Ascent (greedy)

    • Nearest Ascent

  • A genetic algorithm is also available.

Reverse Engineering (Bunch)


Hill climbing algorithms

Hill-Climbing Algorithms

  • Hill-Climbing algorithms get their names because they ‘climb’ to the top of ‘the hills’ of a solution space.

    • Given the current solution, and some transition function, find some better solution.

    • Sub-optimal: Can get caught on one of ‘the smaller hills’ a.k.a. local maximum.

Reverse Engineering (Bunch)


Steepest ascent

Steepest Ascent

  • Steepest Ascent is a ‘greedy’ algorithm.

    • Greedy Algorithms take the most of what they want whenever possible.

    • Steepest Ascent computes all the states it can transition to, and takes the best one.

    • Greedy algorithms aren’t always optimal overall, even though they take the optimal step at each state.

Reverse Engineering (Bunch)


Nearest ascent

Nearest Ascent

  • Nearest Ascent is a non-greedy hill-climbing algorithm

    • Does not compute the set of all possible states to transition to.

    • Computes states until a better one is found.

    • Much faster in practice than Steepest Ascent.

Reverse Engineering (Bunch)


Helping hill climbers

Helping Hill-Climbers

  • Hill-Climbing algorithms are inherently sub-optimal.

    • They still tend to give good results, though.

    • Even with good results, there are ways of improving the quality of the results, without that much effort (relatively).

Reverse Engineering (Bunch)


Bunch clustering

Bunch Clustering

  • Bunch, as it may already seem, views clustering as an optimization problem.

  • The bunch objective function trades off between two things:

    • Cohesiveness of clusters (how intra-related is a cluster?)

    • Coupling of clusters (how inter-related are clusters?)

Reverse Engineering (Bunch)


Random restart

Random Restart

  • One way to help Hill-Climbing algorithms is to restart the algorithm at some random state after finding a local maximum.

    • The idea is to get various local maxima.

    • After finding a set of various solutions, return the best one.

    • The cost isn’t really that much, in comparison to the cost in terms of system size.

Reverse Engineering (Bunch)


Simulated annealing

Simulated Annealing

  • Simulated Annealing is another technique used to assist Hill-Climbing algorithms.

    • Occasionally take a transition that is not an increase in quality.

    • The hope is that this will lead to climbing a better hill.

Reverse Engineering (Bunch)


Genetic algorithms1

Genetic Algorithms

  • A totally different approach is to use Genetic Algorithms.

    • Genetic Algorithms tend to converge to a solution quickly when the solution space is small relative to the search space.

    • Genetic Algorithms tend to provide good results.

    • Genetic Algorithms don’t get stuck at local maxima.

Reverse Engineering (Bunch)


Genetic algorithms2

Genetic Algorithms…

  • Genetic Algorithms work by representing each state as a string, and then doing ‘genetic manipulation’ on a ‘generation’ in ways analogous to nature.

    • Crossing Over

    • Mutations

    • Selection

Reverse Engineering (Bunch)


Measuring mq step 1 the cluster factor

Measuring MQ – Step 1:The Cluster Factor

The Cluster Factor for cluster i, CFi, is:

CF increases as thecluster’s cohesivenessincreases

Subsystem 1

Subsystem 2

Subsystem 3

M1

M6

M1

M4

M6

M3

M7

M3

M7

M2

M8

M2

M5

M8

M4

M5

CF1 = 4/6

CF2 = 2/5

CF3 = 6/7

Reverse Engineering (Bunch)


Modularization quality mq

Modularization Quality (MQ):

  • Modularization Quality (MQ) is a measurement of the “quality” of a particular MDG partition.

krepresents the number of clusters in thecurrent partition of the MDG.

Reverse Engineering (Bunch)


Modularization quality mq1

Modularization Quality (MQ):

  • Modularization Quality (MQ) is a measurement of the “quality” of a particular MDG partition.

  • MQ

  • We have implemented a family of MQfunctions.

  • MQ should support MDGs with edge weights

  • Faster than older MQ (Basic MQ)

    • TurboMQ »O(|V|)

    • ITurboMQ »O(1)

  • ITurboMQ incrementally updates, instead of recalculates, the CFi

krepresents the number of clusters in thecurrent partition of the MDG.

Reverse Engineering (Bunch)


The clustering problem algorithm objectives

The Clustering Problem:Algorithm Objectives

“Find a good partition of the MDG.”

  • A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters.

  • A goodpartition is a partition where:

    • highly interdependent nodes are grouped in the same clusters

    • independent nodes are assigned to separate clusters

  • The better the partition the higher the MQ

Reverse Engineering (Bunch)


The bunch tool automatic clustering

The Bunch Tool:Automatic Clustering

Other Toolsand Utilities

Clustering

Services

MDG

ClusteringAlgorithm

ClusteringAlgorithm

Options

Output

Format

START

Reverse Engineering (Bunch)


Starting bunch

Starting Bunch

  • Now that we know some of the stuff that Bunch does ‘under the hood,’ let’s start to use it.

    • Bunch is a Java program held in a jar file. To run it, type the following:

      java –classpath Bunch.jar bunch.Bunch

Reverse Engineering (Bunch)


Basic controls

Basic Controls

Reverse Engineering (Bunch)


Basic controls description

Basic Controls Description

  • Input Graph File: The mdg to be clustered.

  • Clustering Method: Which algorithm to use? (Hill-Climbing, Genetic, etc).

  • Output Cluster File: Where to put the results?

  • Output File Format: Dotty? Text?

Reverse Engineering (Bunch)


Clustering actions

Clustering Actions

  • Agglomerative Clustering: Bunch will cluster until the system is together in one large cluster.

  • User-Driven Clustering: Here, the user clicks ‘Generate Next Level’ to proceed with each level of clustering.

Reverse Engineering (Bunch)


Clustering options

Clustering Options

Reverse Engineering (Bunch)


Clustering options description

Clustering Options Description

  • Clustering Algorithm: Which ‘clustering’ algorithm to use.

  • Graph File Delimiters: Specify delimiting characters in MDG files.

  • Limiting Runtime: Used to limit the amount of time Bunch will run.

  • Agglomerative Output Options: What type of output for agglomerative clustering.

  • Generate Tree Format: Show clustering hierarchy.

Reverse Engineering (Bunch)


Libraries

Libraries

Reverse Engineering (Bunch)


Libraries1

Libraries

  • Some Modules are just libraries used by the system.

    • These don’t add any semantics to the system.

    • While being a part of what makes the system, they are just small tools that the designers decided to use.

Reverse Engineering (Bunch)


Omni present modules

Omni-Present Modules

Reverse Engineering (Bunch)


Omni present modules1

Omni-Present Modules

  • Some systems have internal modules that are use across a large portion of the system.

    • Including such modules in a decomposition is pointless.

    • Bunch gives the option of removing such modules from the decomposition.

Reverse Engineering (Bunch)


User directed clustering

User-Directed Clustering

Reverse Engineering (Bunch)


User directed clustering1

User-Directed Clustering

  • Sometimes, there will exist a priori knowledge about a system.

    • Such knowledge, when used as constraints on a decomposition, makes it a better decomposition.

    • Bunch allows pre-cluster grouping of modules.

Reverse Engineering (Bunch)


Results

Results…

Reverse Engineering (Bunch)


System evolution

System Evolution

  • Systems change constantly.

    • New developers add to chaos in the maintenance.

    • New features do so, too.

Reverse Engineering (Bunch)


Out with the old

Out with the old?

  • Even though systems change, out of date decompositions don’t have to be thrown away.

    • There still may be some relations that the decomposition represents.

    • An old decomposition may be better than some random start state for clustering a system.

Reverse Engineering (Bunch)


Orphan adoption

Orphan Adoption

  • When a change is applied to one module of a system, the amount of change is relatively small.

    • We can take the module that changed, and see how it fits in with every cluster.

    • We can also see how good of a system we get when it’s alone.

    • Taking the best solution out of these solutions is how Bunch handles this.

Reverse Engineering (Bunch)


Hierarchical clustering 1 tree view

Hierarchical Clustering (1):Tree View

1.

4.

2. Default

3.

Reverse Engineering (Bunch)


Hierarchical clustering 2 standard view

Hierarchical Clustering (2):Standard View

1.

4.

2. Default

3.

Reverse Engineering (Bunch)


Distributed clustering

Distributed Clustering

Distributed clustering allows large MDGs to beclustered faster…

Bunch User Interface(BUI)

Bunch ClusteringService(BCS)

Neighboring

Servers(NS)

Reverse Engineering (Bunch)


Distributed clustering1

Distributed Clustering

Distributed clustering allows large MDGs to beclustered faster…

  • Distributed Clustering Features

  • Multiple Clustering Algorithms

  • Heterogeneous Platforms

  • Adaptive Load Balancing

Bunch User Interface(BUI)

Bunch ClusteringService(BCS)

Neighboring

Servers(NS)

Reverse Engineering (Bunch)


Good clusterings

‘Good’ Clusterings

  • Even with all these features, only the Exhaustive Search gives us the optimal clustering.

  • How do we know how good of a clustering we have?

Reverse Engineering (Bunch)


  • Login