1 / 46

10/05/2002

The Search Landscape of Graph Partitioning Problems using Coupling and Cohesion as the Clustering Criteria. Brian S. Mitchell & Spiros Mancoridis {bmitchel,smancori}@mcs.drexel.edu http://www.mcs.drexel.edu/~{bmitchel,smancori} Department of Computer Science

teryl
Download Presentation

10/05/2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Search Landscape ofGraph Partitioning Problems using Coupling and Cohesion as the Clustering Criteria Brian S. Mitchell & Spiros Mancoridis{bmitchel,smancori}@mcs.drexel.eduhttp://www.mcs.drexel.edu/~{bmitchel,smancori} Department of Computer Science Software Engineering Research Grouphttp://serg.mcs.drexel.edu Drexel University, Philadelphia, PA, USA 10/05/2002

  2. Software Clustering with Bunch Bunch ClusteringTool Visualization Tool Source Code void main(){printf(“hello”);} Bunch GUI ClusteringAlgorithms Source Code Analysis Tools Acacia Chava Clustering Tools Partitioned MDG File MDG File ProgrammingAPI M1 M3 M6 M1 M3 M6 M2 M2 M7 M8 M7 M8 M4 M5 M4 M5

  3. SEARCH SPACESet of All MDG Partitions Software ClusteringSearch Algorithms Source Code void main(){printf(“hello”);} bP = null; while(searching()){p = selectNext(); if(p.isBetter(bP)) bP = p;} return bP; M1 M6 M3 M2 M8 M7 Source Code Analysis Tools M4 M5 Acacia Chava M6 M1 M3 M8 M7 MDG M2 “GOOD” MDG Partition M1 M3 M6 M4 M5 M1 M3 M6 M2 M7 M8 Total = 4140 Partitions M2 M4 M5 M7 M8 M4 M5 Software Clustering as a Search Problem

  4. = Ú = ì 1 if k 1 k n = S í + , n k S kS otherwise î - - - 1 , 1 1 , n k n k The Search Space is Enormous The number of MDG partitions grows very quickly, as the number of modules in the system increases… 1 = 1 2 = 2 3 = 5 4 = 15 5 = 52 6 = 203 7 = 877 8 = 4140 9 = 21147 10 = 115975 11 = 678570 12 = 4213597 13 = 27644437 14 = 190899322 15 = 1382958545 16 = 10480142147 17 = 82864869804 18 = 682076806159 19 = 5832742205057 20 = 51724158235372 A 15 Module System is about the limit for performing Exhaustive Analysis

  5. Our Assumption… “Well designed software systems are organized into cohesive clusters that are loosely interconnected.” • We designed a measurement called MQ that embodies our assumption • The MQ measurement balances cohesion and coupling • We apply MQ to partitions of the MDG

  6. Not all Partitions of the MDG are Good Solutions MDG M1 M4 M2 M3 M5 M6 Good Partition! Bad Partition! M4 M1 M4 M1 M2 M5 M2 M5 M3 M3 M6 M6 MQ(Good Partition) > MQ(Bad Partition)

  7. The Software Clustering Problem:Algorithm Objectives “Find a good partition of the MDG.” • A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters. • A goodpartition is a partition where: • highly interdependent nodes are grouped in the same clusters • independent nodes are assigned to separate clusters • The better the partition the higher the MQ

  8. A neighborpartition iscreated byaltering thecurrentpartition slightly. Neighbor Partition Generate Next Neighbor Measure MQ Current Partition Measure MQ New Best Neighboring Partition Compare to Best Neighboring Partition Better Better? Best Neighboring Partition for Iteration Convergence Best Neighboring Partition Bunch Hill Climbing Clustering Algorithm Generate a Random Decomposition of MDG Iteration Step

  9. Neighbor Partition A neighborpartition iscreated byaltering thecurrentpartition slightly. Generate Next Neighbor Measure MQ Current Partition Measure MQ New Best Neighboring Partition Compare to Best Neighboring Partition Better Better? Best Neighboring Partition for Iteration Convergence Best Neighboring Partition Bunch Hill Climbing Clustering Algorithm Generate a Random Decomposition of MDG Iteration Step Other Things of Interest We have implemented a family ofhill-climbing algorithms We also implemented an Exhaustiveand Genetic Algorithm

  10. Hierarchical Clustering (1):Nested View 1. 4. 2. Default 3.

  11. Hierarchical Clustering (2):Consolidated View 1. 4. 2. Default 3.

  12. Hierarchical Clustering (3):Tree View

  13. Hierarchical Clustering (3):Tree View • Observations • The number of levels for a givensystem’s clustering hierarchy isbounded by: • O(log2N) • because Bunch places at least 2nodes in each cluster.

  14. Evaluating The Software Clustering Results • Over the past few years we have spent a lot of time evaluating Bunch’s software clustering results • Empirically • Semi-formally • Measuring Similarity

  15. What We Know • Given a particular MDG, the results produced by Bunch converge to a family of related solutions • The search space is large, and the probability of finding a good solution by random sampling is infinitesimal

  16. Software Clustering using Graph Partitioning Techniques • Running Bunch multiple times produces a family of related clustering results • Bunch starts with a random partition of the MDG, and makes random moves to explore the search space

  17. Software Clustering using Graph Partitioning Techniques How related are these clustering results?

  18. Software Clustering using Graph Partitioning Techniques Given that there are 2,7644,437 distinct partitionsof this MDG, there is a lot of agreement…

  19. Library Modules Isomorphism OmnipresentModule Influences Software Clustering using Graph Partitioning Techniques Why Some Modules Don’t Agree…

  20. Special Modules • Isomorphic – Modules that are connected to multiple clusters with equal strength • Library – All edges fan-in • Driver – All edges fan-out • Omnipresent – Modules that are strongly connected to many other modules in the system

  21. Clustering a SystemMany Times (1)… Random Bunch RCS Dot

  22. Clustering a SystemMany Times (2)… Random Bunch Swing Bunch

  23. Clustering a SystemMany Times (2)… Random Bunch • Observations • As the number of clusters increasedin the random samples, MQ decreased • Bunch converged to a consistent“family” of solutions, no matter wherethe random starting point was generated • Some solutions were multi-modal • Random solutions were consistentlyworse than Bunch’s solutions. Swing Bunch

  24. 23% 77% The search spacehas some inherentstructure, as randomclusters constrainedto the area whereBunch converged didnot produce better MQ values. Example - Detailed Results: Bunch System

  25. Understanding the Search Space • There are characteristics of Bunch’s clustering algorithms that are interesting: • It seems unusual that the clustering algorithms produce consistent MQ values given the large search space • Other approaches [spectral methods] to solving the clustering problem using Bunch’s MQ have not produced better clustering results • The median clustering level is a good tradeoff between cluster size and number of clusters • Harman et al. examined using a target granularity [GECCO’02] to bias the desired cluster sizes

  26. Investigating the Search Space • Examined multiple systems of different size: • 15 open source systems developed in C, C++, or Java • 13 randomly generated graphs with different properties that we wanted to investigate We clustered each MDG 500 times and examinedthe clustering data to gain some insight into thesearch space.

  27. Example: Median Clustering Level swing Kerbos v.5 Cumulative MQ Cumulative MQ

  28. Example: Median Clustering Level telnetd php MQ MQ

  29. Example: Median Clustering Level X Axis:MQ Value bash mod_ssl lynx elm mailx ping_libc

  30. Example: Median Clustering Level – Random Bipartite Graphs bip-100-1 bip-100-2 bip-100-5 bip-100-25 bip-100-75 X Axis:MQ Value

  31. Example: Median Clustering Level – Random Graphs rnd-100-1 rnd-100-2 rnd-100-5 rnd-100-75 rnd-100-25 X Axis:MQ Value

  32. Example: Median Clustering Level – Random “Circle” Graphs circle-50 circle-100 circle-150 X Axis:MQ Value

  33. X Axis: #ClustersY Axis: MQ Value MQ versus #Clusters krb5 swing telnetd php bash mod_ssl ping_libc elm lynx mailx

  34. X Axis: #ClustersY Axis: MQ Value MQ versus #Clusters bip-100-1 bip-100-5 bip-100-25 bip-100-75 rnd-100-1 rnd-100-5 rnd-100-25 rnd-100-75 cir-50 cir-100 cir-150

  35. X Axis: External EdgesY Axis: Internal Edges Internal- versus External Edges krb5 swing telnetd php bash mod_ssl ping_libc elm lynx mailx

  36. X Axis: External EdgesY Axis: Internal Edges Internal- versus External Edges bip-100-1 bip-100-5 bip-100-25 bip-100-75 rnd-100-1 rnd-100-5 rnd-100-25 rnd-100-75 cir-50 cir-100 cir-150

  37. Real Systems

  38. Random Systems

  39. Real Systems

  40. Random Systems

  41. What we Learned From Studying the Search Landscape • Not all modules are “equal” - Some modules: • Are connected to many other modules • Are connected to few other modules • Have a large fan-in • Have a large fan-out • Are uniformly connected to other system components • Are not uniformly connected to other system components Some modules may have a more “natural” home thanother subsystems with respect to their assigned cluster

  42. What we Learned From Studying the Search Landscape • Bunch tends to converge to a consistent solution with respect to MQ • There is a very low probability of finding one of these partitions by random selection • The partitions found by Bunch are a very small subset of the overall search landscape • The degree of isomorphism in the clustering results was larger than expected

  43. What we Learned From Studying the Search Landscape • When examining the median level of the clustering hierarchy we observed that all systems tend to converge to at most 2 levels • The systems that we studied range from under 100 modules to several thousand modules • The number of levels in the clustering hierarchy is bounded by O(log2N) • We expect that studying systems with several hundred thousand modules would produce results where the median level converges to more than 2 levels. • We observed this in very sparse graphs (e.g., rnd-100-1, and bip-100-1)

  44. Conclusions (1) • Understanding the search landscape is important • A single run of Bunch is helpful, but it does not highlight modules/classes that tend to drift between clusters • Analysis of many Bunch runs helps build a mental model of the search landscape

  45. Conclusions (2) • A best practice for program understanding • Cluster a system many times in order to understand the search landscape • Identify and separate omnipresent, library and supplier modules • Identify that tend to drift between many subsystems • Assign to other clusters manually, or influence the clustering algorithm by adjusting the edge weights • Bunch supports manual and semi-automatic clustering features to help with this type of analysis

  46. Questions • Special Thanks To: • AT&T Research • Sun Microsystems • DARPA • NSF • US Army • SEMINAL Group

More Related