1 / 88

Computational Discovery in Evolving Complex Networks

This study focuses on the computational discovery of evolving complex networks, particularly in the domain of Open Source Software (OSS). The methodology includes data mining, network analysis, computer simulation, and research collaboration. The goal is to search for and predict patterns in OSS development and contribute to computational scientific discovery.

jessiel
Download Presentation

Computational Discovery in Evolving Complex Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Discovery in Evolving Complex Networks Yongqin Gao Advisor: Greg Madey

  2. Outline • Background • Methodology for Computational Discovery • Problem Domain – OSS Research • Process I: Data Mining • Process II: Network Analysis • Process III: Computer Simulation • Process IV: Research Collaboratory • Contributions • Conclusion and Future Work

  3. Background • Network research gains more attentions • Internet • Communication network • Social network • Software developer network • Biological network • Understanding the evolving complex network • Goal I: Search • Goal II: Prediction • Computational scientific discovery

  4. Computational DiscoveryOur Methodology

  5. Problem Domain • Open Source Software Movement • What is OSS • Free to use, modify and distribute and source code available and modifiable • Potential advantages over commercial software: Potentially high quality; Fast development; Low cost • Why study OSS (Goal) • Software engineering — new development and coordination methods • Open content — model for other forms of open, shared collaboration • Complexity — successful example of self-organization/emergence

  6. Glory of OSSNumber of Active Apache Hosts

  7. Problem Domain • SourceForge.net community • The biggest OSS development communities • 134,751 registered projects • 1,439,773 registered users

  8. Problem Domain • Our Data Set • 25 monthly dumps since January 2003. • Totally 460G and growing at 25G/month. • Every dump has about 100 tables. • Largest table has up to 30 million records. • Experiment Environment • Dual Xeon 3.06GHz, 4G memory, 2T storage • Linux 2.4.21-40.ELsmp with PostgreSQL 8.1

  9. Related Research • OSS research • W. Scacchi, “Free/open source software development practices in the computer game community”, IEEE Software, 2004. • C. Kevin, A. Hala and H. James, “Defining open source software project success”, 24th International Conference on Information Systems, Seattle, 2003. • Complex networks • L.A. Adamic and B.A. Huberman, “Scaling behavior of the world wide web”, Science, 2000. • M.E.J. Newman, “Clustering and preferential attachment in growing networks”, Physics Review, 2001.

  10. Process I: Data Mining • Related Research: • S. Chawla, B. Arunasalam and J. Davis, “Mining open source software (OSS) data using association rules network”, PAKDD, 2003. • D. Kempe, J. Kleinberg and E. Tardos, “Maximizing the spread of influence through a social network”, SIGKDD, 2003. • C. Jensen and W. Scacchi, “Data mining for software process discovery in open source software development communities”, Workshop on Mining Software Repositories, 2004.

  11. Knowledge Algorithm Application Feature Selection Relevant data Data Purging Database Data Preparation Process I: Data Mining Raw data

  12. Process I: Data Mining • Data Preparation • Data discovery • Locating the information • Data characterization • Activity features: user categorization • Network features • Data assembly • Data Purging • Treatment about data inconsistency • Unifying the date presentation by loading into single depository • Treatment about data pollution • Removing “inactive” projects • Feature Selection • This method is used to remove dependent or insignificant features. • NMF (Non-negative Matrix Factorization)

  13. Process I: Data Mining • Result I • Significant features • By feature selection, we can identify the significant feature set describing the projects. • Activity features: “file_releases”, “followup_msg”, “support_assigned”, “feature_assigned” and task related features • Network features: “degrees”, “betweenness” and “closeness”

  14. Process I: Data Mining • Distribution-based clustering (Christley, 2005) • Clustering according to the distribution of features instead of values of individual feature • We assume every entity (project) has an underlying distribution of the feature set (activity features) • Using statistical hypothesis test • Non-parametric test • Fisher’s contingency-table test is used • Joachim Krauth, “Distribution-free statistics: an application-oriented approach”, Elsevier Science Publisher, 1988.

  15. Process I: Data Mining • Procedure: While (still unclustered entities) Put all unclustered entities into one cluster While (some entities not yet pairwise compared) A = Pick entity from cluster For each other entity, B, in cluster not yet compared to A Run statistical test on A and B If significant result Remove B from cluster • Worst case complexity: O(n2)

  16. Process I: Data Mining • Result II • Unsupervised learning • Distribution-based method used to cluster the project history using the activity distribution • We named the clusters using ID and the results are shown in the table • High support and confidence in evaluation

  17. Process I: Data Mining • Two sample distributions from different categories • Unbalanced feature distribution → could be “unpopular” • Balanced feature distribution → could be “popular”

  18. Process I: Data Mining • Discoveries in Process I • Significant feature set selection • Network features are important • Further inspection in next process • Distribution based predictor • Based on the activity feature distribution • Prediction of the “popularity” based on the balance of the activity feature distribution • Benefit of these discoveries • For collaboration based communities, these discoveries can help in resource allocation optimization.

  19. Process II: Network Analysis • Why network analysis • Assess the importance of the network measures to the whole network and to individual entity in the network • Inspect the developing patterns of these network measures • Network analysis • Structure analysis • Centrality analysis • Path analysis

  20. Process II: Network Analysis • Related research: • P. Erdös and A. Rényi, “On random graphs”, Publicationes Mathematicae, 1959. • D.J. Watts and S. H. Strogatz, “Collective dynamics of small-world networks”, Nature, 1998. • R. Albert and A.L. Barabάsi, “Emergence of scaling in random networks”, Science, 1999. • Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.

  21. Process II: Network Analysis • Structure Analysis • Understanding the influence of the network structure to individual entities in the network • Inspected measures • Approximate diameter • Approximate clustering coefficient • Component distribution

  22. Process II: Network Analysis • Conversion among C-NET, P-NET and D-NET

  23. Process II: Network Analysis • Result I • Approximate Diameters • D-NET: between (5,7) while network size ranged from 151,803 to 195,744. • P-NET: between (6,8) while network size ranged from 123,192 to 161,798. • Approximate Clustering Coefficient • D-NET: between (0.85, 0.95) • P-NET: between (0.65, 0.75)

  24. Process II: Network Analysis • Result I

  25. Process II: Network Analysis • Centrality Analysis • Understanding the importance of individual entities to the global network structure • Inspected measures: • Average Degrees • Degree Distributions • Betweenness • Closeness

  26. Process II: Network Analysis • Result II • Average Degrees • Developer degree in C-NET: 1.4525 • Project degree in C-NET: 1.7572 • Developer degree in D-NET: 12.3100 • Project degree in P-NET: 3.8059

  27. Process II: Network Analysis • Result II (Degree distributions in C-NET)

  28. Process II: Network Analysis • Result II (Degree distributions in D-NET and P-NET)

  29. Process II: Network Analysis • Result II • Average Betweenness • P-NET: 0.2669e-003 • Average Closeness • P-NET: 0.4143e-005 • Normally these two measures yield very small value in large networks (N>10,000).

  30. Process II: Network Analysis • Path Analysis • Understanding the developing patterns of the network structure and individual entities in the network • Inspected measures: • Active Developer Percentage • Average Degrees • Diameters • Clustering coefficients • Betweenness • Closeness

  31. Process II: Network Analysis • Result III (Active entities)

  32. Process II: Network Analysis • Result III (Average degrees in C-NET)

  33. Process II: Network Analysis • Result III (Average degrees in D-NET and P-NET)

  34. Process II: Network Analysis • Result III (Diameters in D-NET and P-NET)

  35. Process II: Network Analysis • Result III (Clustering coefficients for D-NET and P-NET)

  36. Process II: Network Analysis • Result III (Average betweenness and closeness for P-NET)

  37. Process II: Network Analysis

  38. Process II: Network Analysis • Discoveries in Process II: • Measures of structure analysis and centrality analysis all indicate very high connectivity of the network. • Measures of path analysis reveal the developing patterns of these measures (life cycle behavior). • Benefits of these discoveries • High connectivity in a network is an important feature for information propagation, failure proof. Understanding this discovery can help us improve our practices in collaboration networks and communication networks. • Understanding the developing patterns of these network measures provides us a method to monitor network development and to improve the network if necessary.

  39. Process III: Computer Simulation • Related Research: • P.J. Kiviat, “Simulation, technology, and the decision process”, ACM Transactions on Modeling and Computer Simulation,1991. • R. Albert and A.L. Barabási, “Emergence of scaling in random networks”, Science, 1999. • J. Epstein R. Axtell, R. Axelrod and M. Cohen, “Aligning simulation models: A case study and results”, Computational and Mathematical Organization Theory, 1996. • Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.

  40. Process III: Computer Simulation • Iterative simulation method • Empirical dataset • Model • Simulation • Verification and validation • More measures • More methods

  41. Process III: Computer Simulation • Previous iterated models (master thesis): • Adapted ER Model • BA Model • BA Model with fitness • BA Model with dynamic fitness • Iterated models in this study • Improved Model Four (Model I) • Constant user energy (Model II) • Dynamic user energy (Model III)

  42. Process III: Computer Simulation • Model I • Realistic stochastic procedures. • New developer every time step based on Poisson distribution • Initial fitness based on log-normal distribution • Updated procedure for the weighted project pool (for preferential selection of projects).

  43. Process III: Computer Simulation • Average degrees

  44. Process III: Computer Simulation • Diameter and CC

  45. Process III: Computer Simulation • Betweenness and Closeness

  46. Process III: Computer Simulation • Degree Distributions

  47. Process III: Computer Simulation • Deficit in the measures

  48. Process III: Computer Simulation • Model II • New addition: user energy. • User energy • the “fitness” parameter for the user • Every time a new user is created, a energy level is randomly generated for the user • Energy level will be used to decide whether a user will take a action or not during every time step.

  49. Process III: Computer Simulation • Degree distributions for Model II

  50. Process III: Computer Simulation • Deficit in the measures

More Related