200 likes | 491 Views
2. Motivation. The social process is an important, hard to study, aspect of any software engineering effortCan be studied in many stable and mature OSS projectsNearly all communication is done via internetRecords of both communication and development activity are freely available. 3. Apache Communication and Development (since 1996).
E N D
1. Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz
Department of Computer Science
Anand Swaminathan
Graduate School of Management
University of California, Davis Get names right
Me – project – joint work withGet names right
Me – project – joint work with
2. 2 Motivation The social process is an important, hard to study, aspect of any software engineering effort
Can be studied in many stable and mature OSS projects
Nearly all communication is done via internet
Records of both communication and development activity are freely available Mention that incorporation of newcomers is important and needs to be understood
Mention that social process in traditional projects is hard to study.
Records of both development activity and communication are archived and freely available for most OSS projects
(after first bullet) incorporation of newcomers is vital to the success of an OSS project, so understanding it is valuable
Hard to study in traditional projects.
Not ALL communication is available. A Large amount is.
We want to study the social process on the apache mailing listMention that incorporation of newcomers is important and needs to be understood
Mention that social process in traditional projects is hard to study.
Records of both development activity and communication are archived and freely available for most OSS projects
(after first bullet) incorporation of newcomers is vital to the success of an OSS project, so understanding it is valuable
Hard to study in traditional projects.
Not ALL communication is available. A Large amount is.
We want to study the social process on the apache mailing list
3. 3 Apache Communication and Development (since 1996) 100,000+ messages on dev mailing list
70,000 CVS commits to files Next transition: How do we make sense of this data? We hope to quantitatively evaluate some some common beliefs about the social structure of OSS project..
Enlarge labels and add years
Correlate with major releases
MEMORIZE: Our goal is to use this data to quantitatively evaluate existing hypotheses regarding the social structure of OSS projects
Next transition: How do we make sense of this data? We hope to quantitatively evaluate some some common beliefs about the social structure of OSS project..
Enlarge labels and add years
Correlate with major releases
MEMORIZE: Our goal is to use this data to quantitatively evaluate existing hypotheses regarding the social structure of OSS projects
4. 4 It is widely believed that OSS communities form a hierarchy Either on the slide, on in your talk, mention that this view is qualitative, and would benefit from a quantitative analysis. Look for documenter mailing list.
Use SNA to put this diagram into a more formal, quantitative basis.Either on the slide, on in your talk, mention that this view is qualitative, and would benefit from a quantitative analysis. Look for documenter mailing list.
Use SNA to put this diagram into a more formal, quantitative basis.
5. 5 Social Networks A network consisting of actors and their social ties to each other. Just say nodes are people and ties are dating relationship
Some people are more connected and central than others
Transition: this same formalism of sn has been used in analyzing OSS project beforeJust say nodes are people and ties are dating relationship
Some people are more connected and central than others
Transition: this same formalism of sn has been used in analyzing OSS project before
6. 6 Related Work Xu, Gao, Christley, and Madey looked at developers who worked on the same projects
Crowston & Howison co-ocurrence of developers on a bug-report as a social link
Lopez, Gonzalez-Barahona, & Robles created networks of developers and modules via CVS data.
We believe that responses to emails indicates a strong social link. Unfortunately, there are some hoops to jumped through…
Robe-layz , get the names right.
Mention that we get a much larger network because we don’t include just devs.Unfortunately, there are some hoops to jumped through…
Robe-layz , get the names right.
Mention that we get a much larger network because we don’t include just devs.
7. 7 Issues with Mailing List Analysis
Extracting conversation threads
Rationalizing Timestamps
Identifying targets in a broadcast medium
Resolving Email Aliases
Extracting Content Need to recreate message threads by looking at replies
Need to deal with different time zones and remove messages where clock wasn’t set properly
Talk about extracting/analyzing textual content of message (it’s hard) before aliasing (don’t mention patches)
Need to recreate message threads by looking at replies
Need to deal with different time zones and remove messages where clock wasn’t set properly
Talk about extracting/analyzing textual content of message (it’s hard) before aliasing (don’t mention patches)
8. 8 Email Aliases 2,544 different email address aliases have been used on the apache dev mailing list since 1996.
Many of these email addresses belong to the same people.
The following email addresses were all used by Joe Orton.
Many active developers use the most aliases
Don’t spend too much time on the example.
We just want to exploit the similarity of the emailsMany active developers use the most aliases
Don’t spend too much time on the example.
We just want to exploit the similarity of the emails
9. 9 Email Alias Analysis Preprocess name and address.
Remove commas (“orton, joe” -> “joe orton”)
Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.)
Remove common email terms (list, admin, root)
2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what email aliases are similar.
name-name: “joe orton” vs. “joe e. orton”
email-email: “jorton@foo.com” vs “jorton@bar.org”
name-email:“joe orton” vs. “jorton@foo.com”
3. Manually post process aliases marked as similar to remove the high level of false positives
4. Use similar process to map CVS accounts to email aliases This is not an algorithm
Preprocess by splitting around commas, removing whitespace and punctuation
No need to explain edit distance
We use edit distance and heuristics such as chris bird is cbird and chrisb
We use this to build clusters and manually postprocess the clusters
How many singletons were there?
Transition to talk about resultsThis is not an algorithm
Preprocess by splitting around commas, removing whitespace and punctuation
No need to explain edit distance
We use edit distance and heuristics such as chris bird is cbird and chrisb
We use this to build clusters and manually postprocess the clusters
How many singletons were there?
Transition to talk about results
10. 10 Alias Results 2,544 email aliases used
2,008 unique “identities” used
Many of the high volume participants had a large number of aliases
11. 11 Creating the Email Social Network Each email message has a message id.
A response message contains an “in-response-to” header which includes the message id of the previous message.
If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob.
We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period. Show bob-alice animation example here and talk through it. Should go faster
Message-id’s match so there’s a social network link.Show bob-alice animation example here and talk through it. Should go faster
Message-id’s match so there’s a social network link.
12. 12 Intro to Social Network Metrics In-degree – The number of links whose head is connected to a particular actor
Out-degree – The number of links whose tail is connected to a particular actor
Geodesic – A shortest path between two actors
Betweenness – The number of geodesics that a particular actor lies on. In this slide, explain what betweenness means and why it’s important in a sna context.In this slide, explain what betweenness means and why it’s important in a sna context.
13. 13
14. 14 Betweenness more formally Tansition: to put this idea of betweenness in context on the apache mailing list, it’s useful to look at a picture of it: Tansition: to put this idea of betweenness in context on the apache mailing list, it’s useful to look at a picture of it:
15. 15 Don’t say all info flows through ryan bloom. Past isn’t complete predictor of future.
Now, the complete social network is too big to show, but it’s useful to look at some distribution
Data of the graphs. Don’t say all info flows through ryan bloom. Past isn’t complete predictor of future.
Now, the complete social network is too big to show, but it’s useful to look at some distribution
Data of the graphs.
16. 16 The distribution of in-degree and out-degree both exhibit a power-law character What we have extracted is typical of a sn
Now, we turn to the question: is there any different between developers and non-developers in this social network?
Enlarge labels and state clearly that It’s log-logWhat we have extracted is typical of a sn
Now, we turn to the question: is there any different between developers and non-developers in this social network?
Enlarge labels and state clearly that It’s log-log
17. 17 Status of Developers vs. Non-Developers Note that the largest discrepancy between devs and non-devs is found in the betweenness metric. This indicates that developers are “gate-keepers” or information brokers in the email network. In-degree and out-degree are local measures, whereas betweenness is a more global metric.
Transition: now it’s not that developers are different from non-developers; there’s actually a strong relationship of social network status and development activity. Note that the largest discrepancy between devs and non-devs is found in the betweenness metric. This indicates that developers are “gate-keepers” or information brokers in the email network. In-degree and out-degree are local measures, whereas betweenness is a more global metric.
Transition: now it’s not that developers are different from non-developers; there’s actually a strong relationship of social network status and development activity.
18. 18 Correlation between communication and development Drop the last three columns. Circle the correlations of interest. Divide sn metrics and dev metrics.
Circle the relvant ocrrelations
Next we can see that developers and non-developers can be distinguished from their degrees, right from the time
They first appaer on the email list.
Add arrows for the first and second bulletsDrop the last three columns. Circle the correlations of interest. Divide sn metrics and dev metrics.
Circle the relvant ocrrelations
Next we can see that developers and non-developers can be distinguished from their degrees, right from the time
They first appaer on the email list.
Add arrows for the first and second bullets
19. 19 Observations from the network The mailing list activity reflects a typical social network.
Developers are the “key social brokers”.
More active developers tend to be more important.
Results robust: Postgres showed similar results.
Active development -> important in social networkActive development -> important in social network
20. 20 Topics of future research Visualization of software and social data
Who becomes a developer?
Relationship between communication and collaboration networks
Network Evolution
Conway’s Law
Who becomes a developer? What variables affect who becomes a developer and who doesn’t? number of patches submitted, emails to core devs, betweenness, length of time on mailinglist, etc.)
Who becomes developer – we have new recent data that we’d be happy to share ask me or my advisor
Relationship between communication and collaboration networks. Are defects more likely to occur if two people collaborate but don’t communicate?
Network Evolution – How do the networks change over time. What events cause or precede these changes?
Conway’s LawWho becomes a developer? What variables affect who becomes a developer and who doesn’t? number of patches submitted, emails to core devs, betweenness, length of time on mailinglist, etc.)
Who becomes developer – we have new recent data that we’d be happy to share ask me or my advisor
Relationship between communication and collaboration networks. Are defects more likely to occur if two people collaborate but don’t communicate?
Network Evolution – How do the networks change over time. What events cause or precede these changes?
Conway’s Law
21. 21 Average In-Degree Throw all pictures into the same slide.Throw all pictures into the same slide.