Software Clustering Based on Information Loss Minimization

Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working Conference on Reverse Engineering

The Software Clustering Problem • Input: • A set of software artifacts (files, classes) • Structural information, i.e. interdependencies between the artifacts (invocations, inheritance) • Non-structural information (timestamps, ownership) • Goal: Partition the artifacts into “meaningful” groups in order to help understand the software system at hand Vassilios Tzerpos

Example Have almostthe same dependencies Used by the same program files Program files Utility files Vassilios Tzerpos

Open questions • Validity of clusters discovered based on high-cohesion and low-coupling • No guarantee that legacy software was developed in such a way • Discovering utility subsystems • Utility subsystems are low-cohesion / high-coupling • They commonly occur in manual decompositions • Utilizing non-structural information • What types of information has value? • LOC, timestamps, ownership, directory structure Vassilios Tzerpos

Our goals • Create decompositions that convey as much information as possible about the artifacts they contain • Discover utility subsystems as well as subsystems based on high-cohesion and low-coupling • Evaluate the usefulness of any combination of structural and non-structural information Vassilios Tzerpos

Information Theory Basics • Entropy H(A): • Measures the Uncertainty in a random variable A • Conditional Entropy H(B|A): • Measures the Uncertainty of a variable B,given a value for variable A. • Mutual Information I(A;B): • Measures the Dependence of two random variables A and B. Vassilios Tzerpos

Information Bottleneck (IB) Method • A : random variable that ranges over the artifacts to be clustered • B: a random variable that ranges over the artifacts’ features • I(A;B): mutual information of Aand B • InformationBottleneck Method [TPB’99] • Compress A into a clustering Ck so that the information preserved about Bis maximum (k=number of clusters). • Optimization criterion: • minimize I(A;B) - I(Ck;B) minimize H(B|Ck) – H(B|A) Vassilios Tzerpos

Information Bottleneck Method A: Artifacts B: Features a1 b1 C: Clusters a2 b2 c1 a3 b3 c2 Minimize Loss of I(A;C) Maximize I(C;B) c3 ck an bm Vassilios Tzerpos

Agglomerative IB A\B f1 f2 f3 u1 u2 f1 - .10 .10 .17 .17 f2 .10 - .10 .17 .17 f3 .17 .17 - .17 .17 u1 .17 .17 .17 - 0 u2 .17 .17 .17 0 - • Conceptualize graph as an nxm matrix (artifacts by features) Compute an nxn matrix indicating the information loss we would incur if we joined any two artifacts into a cluster • Merge tuples with the minimum information loss Vassilios Tzerpos

Adding Non-Structural Data • If we have information about the Developer and Location of files we express the artifacts to be clustered using a new matrix • Instead of B we use B’ to include non-structural data • We can compute I(A;B’)and proceed as before Vassilios Tzerpos

ScaLable InforMation BOttleneck • AIB has quadratic complexity since we need to compute an(nxn)distance matrix. • LIMBO algorithm • Produce summaries of the artifacts • Apply agglomerative clustering on the summaries Vassilios Tzerpos

Experimental Evaluation • Data Sets • TOBEY : 939 files / 250,000 LOC • LINUX : 955 files / 750,000 LOC • Clustering Algorithms • ACDC : Pattern-based • BUNCH : Adheres to High-Cohesion and Low-Coupling • NAHC, SAHC • Cluster Analysis Algorithms • Single linkage (SL) • Complete linkage (CL) • Weighted average linkage (WA) • Unweighted average linkage (UA) Vassilios Tzerpos

Experimental Evaluation • Compared output of different algorithms using MoJo • MoJo measures the number of Move/Join operations needed to transform one clustering to another. • The smaller the MoJo value of a particular clustering, the more effective the algorithm that produced it. • We compute MoJo with respect to an authoritative decomposition Vassilios Tzerpos

Structural Feature Results Limbo found Utility Clusters Vassilios Tzerpos

Non-Structural Feature Results • We considered all possible combinations of structural and non-structural features. • Non-Structural Features available only for Linux • Developers (dev) • Directory (dir) • Lines of Code (loc) • Time of Last Update (time) • For each combination we report the number of clusters k when the MoJo value between k and k+1 differs by one. Vassilios Tzerpos

Non-Structural Feature Results • 8 combinations outperform structural results. • “Dir”information produced better decompositions. • “Dev” information has a positive effect. • “Time”leads to worse clusterings. Vassilios Tzerpos

Software Clustering Based on Information Loss Minimization

Software Clustering Based on Information Loss Minimization

Presentation Transcript

Topic9: Density-based Clustering

Frequent Item Based Clustering

Software Clustering Based on Information Loss Minimization

K -MST -based clustering

Personalization in Folksonomies Based on Tag Clustering

Density based Clustering

Pattern-based Clustering

A Faster Counterexample Minimization Algorithm Based on Refutation Analysis

Graph Clustering based on Random Walk

Text Based Information Retrieval Document Clustering / Classification Lecture 3

Auto administration of databases based on clustering

Software Clustering Using Bunch

On-Chip Logic Minimization

Cut-based clustering algorithms

K -MST -based clustering

Clustering Software Artefacts Based on Frequent common changes

Spike Sorting based on Dominant-Sets clustering

Information on Quick Weight Loss

Information On Transport Software

Basic Information on Hearing Loss