1 / 20

Mining Email Social Networks in OSS

2. Motivation. The social process is an important, hard to study, aspect of any software engineering effortCan be studied in many stable and mature OSS projectsNearly all communication is done via internetRecords of both communication and development activity are freely available. 3. Apache Communication and Development (since 1996).

tana
Download Presentation

Mining Email Social Networks in OSS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis Get names right Me – project – joint work withGet names right Me – project – joint work with

    2. 2 Motivation The social process is an important, hard to study, aspect of any software engineering effort Can be studied in many stable and mature OSS projects Nearly all communication is done via internet Records of both communication and development activity are freely available Mention that incorporation of newcomers is important and needs to be understood Mention that social process in traditional projects is hard to study. Records of both development activity and communication are archived and freely available for most OSS projects (after first bullet) incorporation of newcomers is vital to the success of an OSS project, so understanding it is valuable Hard to study in traditional projects. Not ALL communication is available. A Large amount is. We want to study the social process on the apache mailing listMention that incorporation of newcomers is important and needs to be understood Mention that social process in traditional projects is hard to study. Records of both development activity and communication are archived and freely available for most OSS projects (after first bullet) incorporation of newcomers is vital to the success of an OSS project, so understanding it is valuable Hard to study in traditional projects. Not ALL communication is available. A Large amount is. We want to study the social process on the apache mailing list

    3. 3 Apache Communication and Development (since 1996) 100,000+ messages on dev mailing list 70,000 CVS commits to files Next transition: How do we make sense of this data? We hope to quantitatively evaluate some some common beliefs about the social structure of OSS project.. Enlarge labels and add years Correlate with major releases MEMORIZE: Our goal is to use this data to quantitatively evaluate existing hypotheses regarding the social structure of OSS projects Next transition: How do we make sense of this data? We hope to quantitatively evaluate some some common beliefs about the social structure of OSS project.. Enlarge labels and add years Correlate with major releases MEMORIZE: Our goal is to use this data to quantitatively evaluate existing hypotheses regarding the social structure of OSS projects

    4. 4 It is widely believed that OSS communities form a hierarchy Either on the slide, on in your talk, mention that this view is qualitative, and would benefit from a quantitative analysis. Look for documenter mailing list. Use SNA to put this diagram into a more formal, quantitative basis.Either on the slide, on in your talk, mention that this view is qualitative, and would benefit from a quantitative analysis. Look for documenter mailing list. Use SNA to put this diagram into a more formal, quantitative basis.

    5. 5 Social Networks A network consisting of actors and their social ties to each other. Just say nodes are people and ties are dating relationship Some people are more connected and central than others Transition: this same formalism of sn has been used in analyzing OSS project beforeJust say nodes are people and ties are dating relationship Some people are more connected and central than others Transition: this same formalism of sn has been used in analyzing OSS project before

    6. 6 Related Work Xu, Gao, Christley, and Madey looked at developers who worked on the same projects Crowston & Howison co-ocurrence of developers on a bug-report as a social link Lopez, Gonzalez-Barahona, & Robles created networks of developers and modules via CVS data. We believe that responses to emails indicates a strong social link. Unfortunately, there are some hoops to jumped through… Robe-layz , get the names right. Mention that we get a much larger network because we don’t include just devs.Unfortunately, there are some hoops to jumped through… Robe-layz , get the names right. Mention that we get a much larger network because we don’t include just devs.

    7. 7 Issues with Mailing List Analysis Extracting conversation threads Rationalizing Timestamps Identifying targets in a broadcast medium Resolving Email Aliases Extracting Content Need to recreate message threads by looking at replies Need to deal with different time zones and remove messages where clock wasn’t set properly Talk about extracting/analyzing textual content of message (it’s hard) before aliasing (don’t mention patches) Need to recreate message threads by looking at replies Need to deal with different time zones and remove messages where clock wasn’t set properly Talk about extracting/analyzing textual content of message (it’s hard) before aliasing (don’t mention patches)

    8. 8 Email Aliases 2,544 different email address aliases have been used on the apache dev mailing list since 1996. Many of these email addresses belong to the same people. The following email addresses were all used by Joe Orton. Many active developers use the most aliases Don’t spend too much time on the example. We just want to exploit the similarity of the emailsMany active developers use the most aliases Don’t spend too much time on the example. We just want to exploit the similarity of the emails

    9. 9 Email Alias Analysis Preprocess name and address. Remove commas (“orton, joe” -> “joe orton”) Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.) Remove common email terms (list, admin, root) 2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what email aliases are similar. name-name: “joe orton” vs. “joe e. orton” email-email: “jorton@foo.com” vs “jorton@bar.org” name-email:“joe orton” vs. “jorton@foo.com” 3. Manually post process aliases marked as similar to remove the high level of false positives 4. Use similar process to map CVS accounts to email aliases This is not an algorithm Preprocess by splitting around commas, removing whitespace and punctuation No need to explain edit distance We use edit distance and heuristics such as chris bird is cbird and chrisb We use this to build clusters and manually postprocess the clusters How many singletons were there? Transition to talk about resultsThis is not an algorithm Preprocess by splitting around commas, removing whitespace and punctuation No need to explain edit distance We use edit distance and heuristics such as chris bird is cbird and chrisb We use this to build clusters and manually postprocess the clusters How many singletons were there? Transition to talk about results

    10. 10 Alias Results 2,544 email aliases used 2,008 unique “identities” used Many of the high volume participants had a large number of aliases

    11. 11 Creating the Email Social Network Each email message has a message id. A response message contains an “in-response-to” header which includes the message id of the previous message. If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob. We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period. Show bob-alice animation example here and talk through it. Should go faster Message-id’s match so there’s a social network link.Show bob-alice animation example here and talk through it. Should go faster Message-id’s match so there’s a social network link.

    12. 12 Intro to Social Network Metrics In-degree – The number of links whose head is connected to a particular actor Out-degree – The number of links whose tail is connected to a particular actor Geodesic – A shortest path between two actors Betweenness – The number of geodesics that a particular actor lies on. In this slide, explain what betweenness means and why it’s important in a sna context.In this slide, explain what betweenness means and why it’s important in a sna context.

    13. 13

    14. 14 Betweenness more formally Tansition: to put this idea of betweenness in context on the apache mailing list, it’s useful to look at a picture of it: Tansition: to put this idea of betweenness in context on the apache mailing list, it’s useful to look at a picture of it:

    15. 15 Don’t say all info flows through ryan bloom. Past isn’t complete predictor of future. Now, the complete social network is too big to show, but it’s useful to look at some distribution Data of the graphs. Don’t say all info flows through ryan bloom. Past isn’t complete predictor of future. Now, the complete social network is too big to show, but it’s useful to look at some distribution Data of the graphs.

    16. 16 The distribution of in-degree and out-degree both exhibit a power-law character What we have extracted is typical of a sn Now, we turn to the question: is there any different between developers and non-developers in this social network? Enlarge labels and state clearly that It’s log-logWhat we have extracted is typical of a sn Now, we turn to the question: is there any different between developers and non-developers in this social network? Enlarge labels and state clearly that It’s log-log

    17. 17 Status of Developers vs. Non-Developers Note that the largest discrepancy between devs and non-devs is found in the betweenness metric. This indicates that developers are “gate-keepers” or information brokers in the email network. In-degree and out-degree are local measures, whereas betweenness is a more global metric. Transition: now it’s not that developers are different from non-developers; there’s actually a strong relationship of social network status and development activity. Note that the largest discrepancy between devs and non-devs is found in the betweenness metric. This indicates that developers are “gate-keepers” or information brokers in the email network. In-degree and out-degree are local measures, whereas betweenness is a more global metric. Transition: now it’s not that developers are different from non-developers; there’s actually a strong relationship of social network status and development activity.

    18. 18 Correlation between communication and development Drop the last three columns. Circle the correlations of interest. Divide sn metrics and dev metrics. Circle the relvant ocrrelations Next we can see that developers and non-developers can be distinguished from their degrees, right from the time They first appaer on the email list. Add arrows for the first and second bulletsDrop the last three columns. Circle the correlations of interest. Divide sn metrics and dev metrics. Circle the relvant ocrrelations Next we can see that developers and non-developers can be distinguished from their degrees, right from the time They first appaer on the email list. Add arrows for the first and second bullets

    19. 19 Observations from the network The mailing list activity reflects a typical social network. Developers are the “key social brokers”. More active developers tend to be more important. Results robust: Postgres showed similar results. Active development -> important in social networkActive development -> important in social network

    20. 20 Topics of future research Visualization of software and social data Who becomes a developer? Relationship between communication and collaboration networks Network Evolution Conway’s Law Who becomes a developer? What variables affect who becomes a developer and who doesn’t? number of patches submitted, emails to core devs, betweenness, length of time on mailinglist, etc.) Who becomes developer – we have new recent data that we’d be happy to share ask me or my advisor Relationship between communication and collaboration networks. Are defects more likely to occur if two people collaborate but don’t communicate? Network Evolution – How do the networks change over time. What events cause or precede these changes? Conway’s LawWho becomes a developer? What variables affect who becomes a developer and who doesn’t? number of patches submitted, emails to core devs, betweenness, length of time on mailinglist, etc.) Who becomes developer – we have new recent data that we’d be happy to share ask me or my advisor Relationship between communication and collaboration networks. Are defects more likely to occur if two people collaborate but don’t communicate? Network Evolution – How do the networks change over time. What events cause or precede these changes? Conway’s Law

    21. 21 Average In-Degree Throw all pictures into the same slide.Throw all pictures into the same slide.

More Related