1 / 19

Evaluating Software Clustering Algorithms

Evaluating Software Clustering Algorithms. The problem. How do we know that a particular decomposition of a software system is good? What does ‘good’ mean anyway? We can compare against a Mental model Benchmark standard Can be done either manually or automatically. Manual evaluation.

Download Presentation

Evaluating Software Clustering Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EvaluatingSoftware Clustering Algorithms

  2. The problem • How do we know that a particular decomposition of a software system is good? • What does ‘good’ mean anyway? • We can compare against a • Mental model • Benchmark standard • Can be done either manually or automatically COSC6431

  3. Manual evaluation • Have experts evaluate automatic decompositions • Very time-consuming, impractical • Also quite subjective • Need an automatic, objective way of doing it COSC6431

  4. Automatic evaluation • Usually measures the similarity of an automatic decomposition A to an authoritative decomposition M (prepared manually) • Major drawback: Assumes there exists one ‘correct’ decomposition • Other evaluation approaches are possible, such as measuring the stability of an SCA COSC6431

  5. Precision/Recall • Standard Information Retrieval measures • Were applied in a software clustering context by Anquetil (1999) • Definitions: • Intra pair: A pair of software entities in the same cluster • Inter pair: A pair of entities in different clusters COSC6431

  6. Precision/Recall • Precision: Percentage of intra pairs in A which are also intra in M • Recall: Percentage of intra pairs in M found also in A • A good algorithm should exhibit high values in both measures COSC6431

  7. Problems with P/R • Sensitive to the size and number of clusters • Differences are exaggerated if you have small/many clusters • Two values makes comparison harder • Requires an authoritative partition • In judging the similarity of two arbitrary partitions, the two measures are interchangeable COSC6431

  8. Koschke – Eisenbarth measure • Loosely based on Precision/Recall • Attempts to be less strict • Definitions: • GOOD match: Two clusters (one in A, one in M) with both precision and recall larger than a threshold p (typical value 70%) • OK match: Two clusters with only one of the measures larger than p COSC6431

  9. Koschke – Eisenbarth measure • Clusters in A not in a match are called false positive • Clusters in M not in a match are called true negative • A formula computes the overall similarity as the average overlap between all pairs of clusters in a match • Works rather well when A and M are fairly close, but suffers from similar problems as P/R. COSC6431

  10. MoJo distance • The distance between two different partitions of the same software system is defined as the minimum number of Move and Join operations to transform one to the other • Move: Remove an object from a cluster and put it in a different cluster • Join: Merge two clusters into one • Split: Has to be simulated by Move operations COSC6431

  11. Why only Move and Join? • Two clusters can be joined in only one way. One cluster can be split in two in an exponential number of ways • Joining two clusters only means that we performed more detailed clustering than required • Splitting is effectively assigned a weight equal to the cardinality of the smaller of the two resulting clusters COSC6431

  12. Computing MoJo distance • Give each object in partition A a tag j if it belongs in cluster Mj in partition M • For each cluster in A, assign it to group Gk if it contains more objects with tag k than any other tag. • To resolve ties, use the maximum bipartite matching method to maximize the number of groups. • Join clusters within same group. Then move all objects with tag k to the clusters in group Gk. COSC6431

  13. MoJo Distance Calculation • MoJo Distance is M+J, where M is the number of Moves, and J is the number of Joins • The moves for each cluster is , and the total moves is • Assume the number of non-empty groups is g, from G1 to Gg. The joins of each group is|Gk|-1. Then the number of Joins is COSC6431

  14. Outline of Proof • Move & Join operations are commutative • The algorithm uses the minimum number of Moves • Any algorithm that also uses the minimum number of Moves can not beat us • For any algorithm that does not use the minimum number of Moves, there exists another one which is better or equal and uses the minimum number of Moves COSC6431

  15. MoJo Quality Metric • Also assumes the existence of a ‘correct’ authoritative partition • MoJo Distance can be used to measure the distance between two arbitrary partitions COSC6431

  16. MeCl and EdgeSim • Other measures do not consider edges • Edges might convey important information • Definitions: • Intra edges: Edges within a cluster • Inter edges: Edges between clusters • Measure the difference between A and M in terms of the types of edges COSC6431

  17. MeCl and EdgeSim • EdgeSim: • Rewards clustering algorithms for preserving the edge types • Penalizes clustering algorithms for changing the edge types • MeCl: • Rewards the clustering algorithm for creating cohesive “subclusters” COSC6431

  18. Problems with MeCl and EdgeSim • It is possible to construct different partitions where all edges are of the same type (any clusters that have no connection between them can be merged without affecting the edge type) • As usual, the truth is somewhere in the middle… COSC6431

  19. EdgeMoJo • Penalizes for misplacement of nodes and edges • Moving objects carries extra weight if many edges are moved as well • Evaluation of EdgeMoJo still in progress COSC6431

More Related