270 likes | 367 Views
This study provides insights into various aspects of open-source software development processes. The methodology involved selecting projects from FreshMeat repository, coding attributes, analyzing developers and subscribers, evaluating project size evolution, and drawing conclusions on productivity, developer distribution, and code variation over clusters. The findings challenge some common assumptions about open-source projects and emphasize the importance of empirical research in this field.
E N D
Characterizing the Open Source Software Process: a Horizontal Study A. Capiluppi, P. Lago, M. Morisio
Outline • Rationale behind the current study • Methodology • Conclusions • Actual and future work
Rationale • Most Open Source analyses focus on a single, flagship project (Linux, Apache, GNOME) • Limitation: the conclusions are based on a ‘vertical’ study • there is a lack of ‘horizontal’ studies • a pool of projects • a wider area of interest
Methodology • Choice of projects • Attributes definition • Coding • Analysis
Choice of projects: repository • Selected FreshMeat repository • FreshMeat (http://freshmeat.net) is focused on Open Source development since 1996 • It gathers thousands of projects, either doubled on the pages of SourceForge (http://sourceforge.net), or hosted on FreshMeat only. • FreshMeat lists more than 24000 projects (many inactive)
Choice of projects: sampling I • From 24000 to 406 - how? • FreshMeat organizes projects by filters and categories • Filter = “Topic” • Categories = {“Internet”, “Database”, “Multimedia”,…} • Other filters: Programming language, Topic (i.e. application domain), Status of Evolution, etc.
Choice of projects: sampling II • We picked randomly a number of projects through the “Status” filter • Rationale: limited number of categories associated {“Planning”, “PreAlpha”, “Alpha”, “Beta”, “Stable”, “Mature”} • The overall count is 406 projects
Attribute definition • Age • Application domain • Programming language • Size [KB] • Number of developers • Stable and transient developers • Number of users Modularity level Documentation level Popularity Status Success of project Vitality • Red: defined by FreshMeat • Black: defined by us
Coding • Each attribute was coded twice, to capture evolutive trends • First observation: January 2002 • Second observation: July 2002
Analysis • Here we discuss: • Application domain issues • Developers [stable & transient] issues • Subscribers (as users) issues • Code size issues
Attributes: project’s developers • We evaluate how many people write code for an application • External contributions are always credited in special-purpose files, or in the ChangeLog • We distinguish between • Stable developers • Transient developers • Core team: more than one stable developer • Manual inspections and pattern-recognition scripts
Developers over projects • We observe: • 72% of projects have a single stable developer • 80% of projects have at most a number of 10 developers
Definition: clusters of developers • Cluster 1: 1 to 3 developers (64.5%) • Cluster 2: 4 to 10 developers (20%) • Cluster 3: 11 to 20 developers (9.5%) • “Average” nr. of stable dev: 2 • “Average” nr. of transient dev: 3 • Cluster 4: more than 20 developers (6%) • “Average” nr. of stable dev: 6 • “Average” nr. of stable dev: 19
Attributes: subscribers • We use some publicly available data to gather some proxy about users • Users ~ Mailing List subscribers (public datum) • It’s not a monotonic measure: subscribers can join and leave as well • We have a measure of users in two different observations
Distribution of subscribers over project Around 42% of projects have at most 1 subscriber-user
Attributes: project’s size • We evaluate the code of each project twice • Code evaluated is contained in packages. We exclude from the count: • Auxiliary files: documentation, configuration files, GIF files, etc. • Legacy code: inherited libraries (e.g. Gnome macros), internationalization code
Conclusions I • The vast majority of projects are developed by only one developer • Adding people to a project has small effect on productivity (i.e. code per developer) • Open Source software is made by experts for experts (72% of horizontal projects have more than 10 developers) • 58% of projects didn’t change their size • 63% of projects had a change within 1%
Conclusions II • Java is relevant for 8% of the projects, C/C++ for 56%, PERL with Python for 20% • Observations from flagship projects (Apache, Linux, Gnome) are not confirmed for an average Open Source project • Several projects are white noise: to be filtered out • Huge amount of data on public repositories: empirical researchers have an invaluable resource of software data
Current and future work • Eliminating white noise: only projects in cluster 3 and 4 have been selected • Deeper analysis: the whole story of a project is being studied • What can we say with respect of conclusions on bigger OS projects? • What can be said about OSS evolution compared with traditional software evolution?