1 / 16

Software Clustering Based on Information Loss Minimization

Software Clustering Based on Information Loss Minimization. Periklis Andritsos University of Toronto Vassilios Tzerpos York University. The 10th Working Conference on Reverse Engineering. The Software Clustering Problem. Input: A set of software artifacts (files, classes)

myrnas
Download Presentation

Software Clustering Based on Information Loss Minimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working Conference on Reverse Engineering

  2. The Software Clustering Problem • Input: • A set of software artifacts (files, classes) • Structural information, i.e. interdependencies between the artifacts (invocations, inheritance) • Non-structural information (timestamps, ownership) • Goal: Partition the artifacts into “meaningful” groups in order to help understand the software system at hand Vassilios Tzerpos

  3. Example Have almostthe same dependencies Used by the same program files Program files Utility files Vassilios Tzerpos

  4. Open questions • Validity of clusters discovered based on high-cohesion and low-coupling • No guarantee that legacy software was developed in such a way • Discovering utility subsystems • Utility subsystems are low-cohesion / high-coupling • They commonly occur in manual decompositions • Utilizing non-structural information • What types of information has value? • LOC, timestamps, ownership, directory structure Vassilios Tzerpos

  5. Our goals • Create decompositions that convey as much information as possible about the artifacts they contain • Discover utility subsystems as well as subsystems based on high-cohesion and low-coupling • Evaluate the usefulness of any combination of structural and non-structural information Vassilios Tzerpos

  6. Information Theory Basics • Entropy H(A): • Measures the Uncertainty in a random variable A • Conditional Entropy H(B|A): • Measures the Uncertainty of a variable B,given a value for variable A. • Mutual Information I(A;B): • Measures the Dependence of two random variables A and B. Vassilios Tzerpos

  7. Information Bottleneck (IB) Method • A : random variable that ranges over the artifacts to be clustered • B: a random variable that ranges over the artifacts’ features • I(A;B): mutual information of Aand B • InformationBottleneck Method [TPB’99] • Compress A into a clustering Ck so that the information preserved about Bis maximum (k=number of clusters). • Optimization criterion: • minimize I(A;B) - I(Ck;B) minimize H(B|Ck) – H(B|A) Vassilios Tzerpos

  8. Information Bottleneck Method A: Artifacts B: Features a1 b1 C: Clusters a2 b2 c1 a3 b3 c2 Minimize Loss of I(A;C) Maximize I(C;B) c3 ck an bm Vassilios Tzerpos

  9. Agglomerative IB A\B f1 f2 f3 u1 u2 f1 - .10 .10 .17 .17 f2 .10 - .10 .17 .17 f3 .17 .17 - .17 .17 u1 .17 .17 .17 - 0 u2 .17 .17 .17 0 - • Conceptualize graph as an nxm matrix (artifacts by features) Compute an nxn matrix indicating the information loss we would incur if we joined any two artifacts into a cluster • Merge tuples with the minimum information loss Vassilios Tzerpos

  10. Adding Non-Structural Data • If we have information about the Developer and Location of files we express the artifacts to be clustered using a new matrix • Instead of B we use B’ to include non-structural data • We can compute I(A;B’)and proceed as before Vassilios Tzerpos

  11. ScaLable InforMation BOttleneck • AIB has quadratic complexity since we need to compute an(nxn)distance matrix. • LIMBO algorithm • Produce summaries of the artifacts • Apply agglomerative clustering on the summaries Vassilios Tzerpos

  12. Experimental Evaluation • Data Sets • TOBEY : 939 files / 250,000 LOC • LINUX : 955 files / 750,000 LOC • Clustering Algorithms • ACDC : Pattern-based • BUNCH : Adheres to High-Cohesion and Low-Coupling • NAHC, SAHC • Cluster Analysis Algorithms • Single linkage (SL) • Complete linkage (CL) • Weighted average linkage (WA) • Unweighted average linkage (UA) Vassilios Tzerpos

  13. Experimental Evaluation • Compared output of different algorithms using MoJo • MoJo measures the number of Move/Join operations needed to transform one clustering to another. • The smaller the MoJo value of a particular clustering, the more effective the algorithm that produced it. • We compute MoJo with respect to an authoritative decomposition Vassilios Tzerpos

  14. Structural Feature Results Limbo found Utility Clusters Vassilios Tzerpos

  15. Non-Structural Feature Results • We considered all possible combinations of structural and non-structural features. • Non-Structural Features available only for Linux • Developers (dev) • Directory (dir) • Lines of Code (loc) • Time of Last Update (time) • For each combination we report the number of clusters k when the MoJo value between k and k+1 differs by one. Vassilios Tzerpos

  16. Non-Structural Feature Results • 8 combinations outperform structural results. • “Dir”information produced better decompositions. • “Dev” information has a positive effect. • “Time”leads to worse clusterings. Vassilios Tzerpos

More Related