slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based PowerPoint Presentation
Download Presentation
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

Loading in 2 Seconds...

play fullscreen
1 / 43

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based - PowerPoint PPT Presentation


  • 219 Views
  • Uploaded on

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based' - Patman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

Greg MadeyComputer Science & EngineeringUniversity of Notre Dame

UIUC - NSF Workshop on Continuous (Re)Design of

Open Source Software

University of Illinois, Urbana-Champaign

October 8-9, 2003

This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829

contributors
Contributors
  • Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator)
  • Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student)
  • Jeff Goett, University of Notre Dame (REU Student)
  • Chris Hoffman, University of Notre Dame (REU Student)
  • Nadir Kiyanclar, University of Notre Dame (REU Student)
  • Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator)
  • Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator)
  • Carlos Siu, University of Notre Dame (REU Student)
  • Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator)
  • Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)
outline
Outline
  • Research approach
  • Tools and definitions: Agents, models, simulations, collaborative social networks, computer experiments
  • Data collection and analysis
  • Example research question
  • Simulation
  • Computer experiments
  • Results
one approach to researching f ossd
One Approach to Researching F/OSSD
  • Online data
    • Screen scraping
    • Database dumps
  • Modeling
    • Social network theory
    • Evolutionary assumptions
  • Simulation
    • Verification and validation
    • Computer experiments
  • Variation of Classical Scientific Method
classical scientific method
Classical Scientific Method
  • Observe the world
    • Identify a puzzling phenomenon
  • Generate a falsifiable hypothesis (K. Popper)
  • Design and conduct an experiment with the goal of disproving the hypothesis
    • If the experiment “fails”, then the hypothesis is accepted (until replaced)
    • If the experiment “succeeds”, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated
agent based simulation as a component of the scientific method
Agent-Based Simulation as a Component of the Scientific Method

Modeling

(Hypothesis)

Agent -Based

Simulation

(Experiment)

Observation

agent based simulation as a component of the scientific method8
Agent-Based Simulation as a Component of the Scientific Method

Modeling

(Hypothesis)

Social Network

Model of F/OSS

Agent -Based

Simulation

(Experiment)

Observation

Analysis of

Grow Artificial

SourceForge

SourceForge

Data

agent based modeling and simulation
Agent-Based Modeling and Simulation
  • Conceptual models of a phenomenon
  • Simulations are computer implementations of the conceptual models
  • Agents in models and simulations are distinct entities (instantiated objects)
    • Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence
    • Contrasted with higher level AI “intelligent agents”
  • Foundations in complexity theory
    • Self-organization
    • Emergence
collaborative social networks
Collaborative Social Networks
  • Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001)
  • Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003)
  • Interlocking corporate directorships
  • Terrorist Networks
  • Open-source software developers (Madey et al, AMCIS 2002)
  • Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon
sourceforge
SourceForge
  • VA Software
  • Part of OSDN
  • Started 12/1999
  • Collaboration tools
  • 70,000 Projects
  • 90,000 Developers
  • 700,00 Registered Users
savannah
Savannah
  • SourceForge Software?
  • Free Software Foundation
  • 1,600 Projects
  • 16,000 Registered Users
observations
Observations
  • Web mining
  • Web crawler (scripts)
    • Python
    • Perl
    • AWK
    • Sed
  • Monthly
  • Since Jan 2001
  • ProjectID
  • DeveloperID
  • Almost 2 million records
  • Relational database

PROJ|DEVELOPER

8001|dev378

8001|dev8975

8001|dev9972

8002|dev27650

8005|dev31351

8006|dev12509

8007|dev19395

8007|dev4622

8007|dev35611

8008|dev8975

slide14

Collaboration Networks

Adapted from Newman, Strogatz and Watts, 2001

slide15

dev[72]

dev[67]

dev[52]

dev[65]

dev[70]

dev[57]

7597 dev[46]

6882 dev[47]

dev[64]

dev[45]

dev[99]

7597 dev[46]

7597 dev[46]

dev[52]

dev[72]

dev[67]

7597 dev[46]

6882 dev[47]

dev[47]

dev[55]

dev[55]

dev[55]

7597 dev[46]

7028 dev[46]

dev[70]

7597 dev[46]

7028 dev[46]

dev[57]

dev[45]

dev[99]

dev[51]

7597 dev[46]

7028 dev[46]

6882 dev[47]

6882 dev[58]

dev[61]

dev[51]

dev[79]

dev[47]

7597 dev[46]

dev[58]

dev[58]

dev[46]

9859 dev[46]

dev[54]

15850 dev[46]

dev[58]

9859 dev[46]

dev[79]

dev[58]

9859 dev[46]

dev[49]

dev[53]

9859 dev[46]

15850 dev[46]

dev[59]

dev[56]

15850 dev[46]

dev[83]

15850 dev[46]

dev[48]

dev[53]

dev[56]

dev[83]

dev[48]

F/OSS Developers - Collaboration Social Network

Developers are nodes / Projects are links

24 Developers

5 Projects

Project 7597

2 Linchpin Developers

1 Cluster

dev[64]

Project 6882

Project 7028

dev[61]

dev[54]

dev[49]

dev[59]

Project 9859

Project 15850

topological analysis of the data
Topological Analysis of the Data
  • Statistics inspected
    • Diameter
    • Average degree
    • Clustering coefficient
    • Degree distribution
    • Cluster size distribution
    • Relative size of major cluster
    • Fitness and life cycle
  • Evolution of these statistics
  • Dual networks
    • developer network and project network
terminology
Terminology
  • Diameter
    • Average length of shortest paths between all pairs of vertices
  • Degree
    • The count of edges connected to given vertex
  • Average degree
    • Average of the degrees of all vertices in the network
  • Cluster
    • The connected components of the network
  • Clustering coefficient (CC)
    • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood.
    • CC: average of all CCi in a network
  • Degree distribution
    • The distribution of degrees throughout a network
  • Major cluster
    • The largest cluster in the network
diameter of developer network vs time
Diameter of Developer Network vs. Time
  • Network size increased from 30,000 to 70,000
diameter of project network vs time
Diameter of Project Network vs. Time
  • Network size increased from 20,000 to 50,000.
  • Diameter decreasing with time both for developer network and project network
cluster size distribution
Cluster Size Distribution
  • R2 with major cluster is 0.7426
  • R2 without major cluster is 0.9799
relative size of major cluster vs time
Relative Size of Major Cluster vs. Time
  • Increase of the relative size of the major cluster
  • Approaching steady-state?
an example research question
An Example Research Question
  • What processes can explain the evolution of the project and developer social networks?
    • Randomly growing network (Erdos-Reyni, 1960)?
    • Evolving network with preferential attachment (Barabasi-Albert, 1999)?
    • Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)?
    • Others?
computer experiments
Computer Experiments
  • Agent-based simulations
  • Java programs using Swarm class library
    • Validation (docking) exercises using Java/Repast
  • Grow artificial SourceForge’s (Epstein & Axtell, 1996)
    • Parameterized with observed data, e.g., developer behaviors
      • Join rates
      • New project additions
      • Leave projects
    • Evaluation of multiple models (hypotheses)
    • Verification/validation
cycles of modeling simulation
Cycles of Modeling & Simulation

Modeling

(Hypothesis)

Social Network Models

ER => BA => BA+Fitness => BA+Dynamic Fitness

Agent -Based

Simulation

(Experiment)

Observation

Degree Distribution

Average Degree

Diameter

Clustering Coefficient

Cluster Size Distribution

Analysis of

Grow Artificial

SourceForge

SourceForge

Data

model for sourceforge
Model for SourceForge
  • ABM based on bipartite graph
  • Model description
    • Agent: developer
    • Behaviors: Create, join, abandon and idle
    • Preference: developer’s and project’s
    • Fitness
  • Four models in iterations
    • ER, BA, BA with constant fitness and BA with dynamic fitness
  • Comparison of empirical and simulated data
er model degree distribution
ER Model – Degree Distribution
  • Degree distribution is normal distribution while it is power law in empirical data
  • Fit Fails!
er model diameter
ER Model - Diameter
  • Average degree is decreasing while it is increasing in empirical data
  • Diameter is increasing while it is decreasing in empirical data
  • Fit Fails!
er model clustering coefficient
ER Model – Clustering Coefficient
  • Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data.
  • Fit fails!
er model cluster size distribution
ER Model – Cluster Size Distribution
  • Power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster)
  • The actual distribution is different from empirical data
  • Fit Fails!
ba model degree distribution
BA Model – Degree Distribution
  • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).
  • For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714.
  • For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838.
  • Partial Fit!
ba model diameter and clustering coefficient
BA Model – Diameter and Clustering Coefficient
  • Small diameter and high clustering coefficient like empirical data
  • Diameter and clustering coefficient are both decreasing like empirical data
  • Good Fit!
ba model with constant fitness
BA Model with Constant Fitness
  • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data).
  • For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714.
  • For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838.
  • Improved fit!
ba model with dynamic fitness
BA Model with Dynamic Fitness
  • Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data).
  • For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714.
  • For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838.
  • Somewhat better fit!
models of the f oss social network alternative hypotheses
Models of the F/OSS Social Network(Alternative Hypotheses)
  • General model features
    • Agents are nodes on a graph (developers or projects)
    • Behaviors: Create, join, abandon and idle
    • Edges are relationships (joint project participation)
    • Growth of network: random or types of preferential attachment, formation of clusters
    • Fitness
    • Network attributes: diameter, average degree, degree distribution, clustering coefficient
  • Four specific models
    • ER (random graph) - (1960)
    • BA (preferential attachment) - (1999)
    • BA ( + constant fitness) - (2001)
    • BA ( + dynamic fitness) - (2003)
summary41
Summary
  • Why Agent-Based Modeling and Simulation?
    • Can be used as components of the Scientific Method
    • A research approach for studying socio-technical systems
  • Case study: F/OSS - Collaboration Social Networks
    • SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness.
    • Simulations
      • Computer experiments that tested conceptual models
      • Provided insight into the phenomenon under study and guided data mining of collected observations
questions
Questions
  • Validity of approaches
    • Social networks
    • Simulation
  • Value/Utility of approachs
  • Applicability to other areas of F/OSS research
    • Project sites, e.g., Mozilla.org
    • Individual projects, e.g., Linux kernel