topology and evolution of the open source software community l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Topology and Evolution of the Open Source Software Community PowerPoint Presentation
Download Presentation
Topology and Evolution of the Open Source Software Community

Loading in 2 Seconds...

play fullscreen
1 / 49

Topology and Evolution of the Open Source Software Community - PowerPoint PPT Presentation


  • 275 Views
  • Uploaded on

Topology and Evolution of the Open Source Software Community Yongqin Gao Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science Foundation – Digital Science & Technology Outline Overview Data collection Network modeling

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Topology and Evolution of the Open Source Software Community' - Leo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
topology and evolution of the open source software community

Topology and Evolution of the Open Source Software Community

Yongqin Gao

Advisors:

Dr. Vincent W. Freeh

Dr. Kevin Bowyer

Supported in part by

the National Science Foundation – Digital Science & Technology

outline
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis (real data)
  • Simulations
  • Publications
  • Conclusions
overview about oss
Overview (about OSS)
  • What is OSS
    • Free to use, free to distribute
    • Unlimited user and usage
    • Source code available and modifiable
  • Potential advantages over commercial software
    • Higher quality
    • Faster development
    • Lower cost
    • Transparent
overview about our research
Overview (about our research)
  • Our goal
    • Understanding the OSS phenomenon
  • Approach
    • SourceForge is the source of our empirical data
    • Modeling as a social network
    • Analysis of topological statistics
    • Use simulation to verify and validate the model
outline5
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis
  • Simulations
  • Publications
  • Conclusions
data collection monthly
Data Collection — Monthly
  • Web crawler (scripts)
    • Python
    • Shell
    • AWK
    • Sed
  • Monthly
  • Since Jan 2001
  • ProjectID
  • DeveloperID
  • Almost 2 million records
  • Relational database

PROJ|DEVELOPER

8001|dev348

8001|dev8972

8001|dev9922

8002|dev27650

8005|dev31351

8006|dev12409

8007|dev19935

8007|dev4262

8007|dev36711

8008|dev8972

outline7
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis (real data)
  • Simulations
  • Publications
  • Conclusions
modeling as collaboration network
Modeling as Collaboration Network
  • What is a collaboration network?
    • A social network representing the collaborating relationships.
    • Movie actor network and scientist collaboration network
  • Difference of SourceForge collaboration network
    • Link detachment
    • Virtual collaboration
    • Voluntary
    • Global
  • Bipartite property of collaboration networks
collaboration network bipartite
Collaboration network - bipartite

Adapted from Newman, Strogatz and Watts, 2001

sourceforge developer network

dev[72]

dev[67]

dev[52]

dev[65]

dev[70]

dev[57]

7597 dev[46]

6882 dev[47]

dev[45]

dev[64]

dev[99]

7597 dev[46]

7597 dev[46]

dev[52]

dev[72]

dev[67]

7597 dev[46]

dev[47]

6882 dev[47]

dev[55]

dev[55]

dev[55]

7597 dev[46]

7028 dev[46]

dev[70]

7597 dev[46]

7028 dev[46]

dev[57]

dev[45]

dev[51]

dev[99]

7597 dev[46]

7028 dev[46]

6882 dev[47]

6882 dev[58]

dev[61]

dev[51]

dev[79]

dev[47]

dev[58]

7597 dev[46]

dev[58]

dev[46]

9859 dev[46]

dev[54]

15850 dev[46]

dev[58]

9859 dev[46]

dev[79]

Dev[80]

9859 dev[46]

dev[49]

dev[53]

9859 dev[46]

15850 dev[46]

dev[59]

dev[56]

15850 dev[46]

dev[83]

15850 dev[46]

dev[48]

dev[53]

dev[56]

dev[83]

dev[48]

SourceForge Developer Network

OSS Developer Network (Part)

Project 7597

Developers are nodes / Projects are links

24 Developers

dev[64]

5 Projects

2 hub Developers

Project 6882

1 Cluster

Project 7028

dev[61]

dev[54]

dev[49]

dev[59]

Project 9859

Project 15850

outline11
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis (real data)
  • Simulations
  • Publications
  • Conclusion
topological analysis
Topological Analysis
  • Statistics inspected
    • Diameter
    • Average degree
    • Clustering coefficient
    • Degree distribution
    • Cluster size distribution
    • Relative size of major cluster
    • Fitness and life cycle
  • Evolution of these statistics
  • Dual networks
    • developer network and project network
terminology
Terminology
  • Diameter
    • Average length of shortest paths between all pairs of vertices
  • Degree
    • The count of edges connected to given vertex
  • Average degree
    • Average of the degrees of all vertices in the network
  • Cluster
    • The connected components of the network
  • Clustering coefficient (CC)
    • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood.
    • CC: average of all CCi in a network
  • Degree distribution
    • The distribution of degrees throughout a network
  • Major cluster
    • The largest cluster in the network
diameter of developer network vs time
Diameter of Developer Network vs. Time
  • Network size increased from 30,000 to 70,000
diameter of project network vs time
Diameter of Project Network vs. Time
  • Network size increased from 20,000 to 50,000.
  • Diameter decreasing with time both for developer network and project network
cluster size distribution
Cluster Size Distribution
  • R2 with major cluster is 0.7426
  • R2 without major cluster is 0.9799
relative size of major cluster vs time
Relative Size of Major Cluster vs. Time
  • Increase of the relative size of the major cluster
  • Increasing rate is decreasing
  • May be an indication of the network evolution
existence of fitness
Existence of Fitness
  • Investigation of development of single project can verify the existence of “newcomer” phenomenon
  • We tracked the development of every new project in July 2001 until now (total 1660 projects)
  • Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639
summary of results
Summary of Results
  • Power law rules
    • Degree distributions, cluster distribution
  • Average degree increasing with time
  • Diameter decreasing with time
  • Clustering coefficient decreasing with time
  • Fitness existed in SourceForge
  • Projects have life cycle behaviors
outline26
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis (real data)
  • Simulations
  • Publications
  • Conclusion
agent based modeling
Agent-based Modeling
  • EBM vs. ABM
    • Heterogeneous individuals
    • Complex network
  • Experience environment
    • Hardware: computer cluster
    • Software:
      • Simulation toolkits: Swarm
      • Database: Oracle
      • Language: Java, PL/SQL
model for sourceforge
Model for SourceForge
  • ABM based on bipartite graph
  • Model description
    • Agent: developer
    • Behaviors: Create, join, abandon and idle
    • Preference: developer’s and project’s
    • Fitness
  • Four models in iterations
    • ER, BA, BA with constant fitness and BA with dynamic fitness
  • Comparison of empirical and simulated data
er model diameter
ER Model - Diameter
  • Average degree is decreasing while it is increasing in empirical data
  • Diameter is increasing while it is decreasing in empirical data
er model clustering coefficient
ER Model – Clustering Coefficient
  • Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data.
er model degree distribution
ER Model – Degree Distribution
  • Degree distribution is normal distribution while it is power law in empirical data
er model cluster size distribution
ER Model – Cluster Size Distribution
  • power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster)
  • The actual distribution is different from empirical data
ba model diameter and clustering coefficient
BA Model – Diameter and Clustering Coefficient
  • Small diameter and high clustering coefficient like empirical data
  • Diameter and clustering coefficient are both decreasing like empirical data
ba model degree distribution
BA Model – Degree Distribution
  • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).
  • For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714.
  • For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838.
ba model with constant fitness
BA Model with Constant Fitness
  • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).
  • For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714.
  • For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838.
ba model with dynamic fitness
BA Model with Dynamic Fitness
  • Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data).
  • For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714.
  • For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838.
advantage of dynamic fitness
Advantage of Dynamic Fitness
  • Intuition: Fitness should decreasing with time.
  • Statistics: project has life cycle behavior which can not be replicated by BA model with constant fitness but can be replicated by BA model with dynamic fitness
summary of results40
Summary of Results
  • We use ABM to model and simulate the SourceForge collaboration network.
  • Conceptual framework is proposed for agent-based modeling and simulation.
  • Case study of this framework: SourceForge study through ER, BA, BA with constant fitness and BA with dynamic fitness.
outline41
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis (real data)
  • Simulations
  • Publications
  • Conclusion
publications to date
Publications To-date
  • Yongqin Gao, "Modeling and Simulation of  the OSS Community", Seventh Annual Swarm Researchers Meeting (Swarm2003), Notre Dame, IN, 2003.
  • Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis and Modeling of the Open Source Software Community", NAACSOS Conference 2003, Pittsburgh.
  • Yongqin Gao, Vince Freeh, and Greg Madey, "Conceptual Framework for Agent-based Modeling and Simulation", NAACSOS Conference 2003, Pittsburgh.
  • Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, Chris Hoffman, "Agent-based Modeling and Simulation of Collaborative Social Networks", AMCIS 2003, Tampa, FL.
possible journals
Possible Journals
  • Chapter 3
    • Physica A: statistical mechanics and its applications
    • Journal of Social Structure (JSS)
  • Chapter 4
    • Journal of Artificial Societies and Social Simulation (JASSS)
    • Journal of Statistical Computation and Simulation (JSCS)
outline44
Outline
  • Overview
  • Data collection
  • Network modeling
  • Topological statistical analysis (real data)
  • Simulations
  • Publications
  • Conclusion
conclusion
Conclusion
  • Study of SourceForge collaboration network can help us understanding the OSS community
  • We investigate not only the topological statistics but also the evolution of these statistics.
  • Simulation is used to investigate of SourceForge collaboration network.
contribution
Contribution
  • Statistical study of the SourceForge community (snapshot and evolution)
  • Verification of the approximate method to calculate the diameter and CC
  • Proposal of a model for the SourceForge community
  • Improvement of dynamic fitness to BA model
future work
Future Work
  • Data collection
    • Database dump from SourceForge (PostgreSQL 8GB)
    • All the possible attributes
    • Database schema in UML
  • More topology analysis (with more attributes)
    • Discussion forum
    • Task assignment
    • Project management
    • Active testing
  • Behavior-based analysis
    • Interaction between agents
    • H. Beyton Young’s model
  • Information entropy analysis
acknowledgements
Acknowledgements
  • Committee
  • Advisors
  • Colleagues
  • SourceForge
  • NSF
  • Others