1 / 70

Evolution in Open Source Software: A Case Study

Evolution in Open Source Software: A Case Study. Michael W. Godfrey Qiang Tu [paper in ICSM 2000] Software Architecture Group University of Waterloo. What is software evolution?. “ Evolution is what happens while you’re busy making other plans.”

effie
Download Presentation

Evolution in Open Source Software: A Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution in Open Source Software: A Case Study Michael W. Godfrey Qiang Tu [paper in ICSM 2000] Software Architecture Group University of Waterloo

  2. What is software evolution? “Evolution is what happens while you’re busy making other plans.” • Usually, we consider evolution to begin once the first version has been delivered: • Maintenance is the planned set of tasks to effect changes. • Evolution is what actually happens to the software.

  3. Previous research • Lehman’s laws • Parnas on software geriatrics • Eick et al. on code decay (10 MLOC telecom) • Gall et al. (10 MLOC telecom) • Munro, Burd et al. (2 MLOC gcc)

  4. Lehman’s Laws in a nutshell • Observations: • (Most) useful software must evolve or die. • As a software system gets bigger, its resulting complexity tends to limit its ability to grow. • Development progress/effort is (more or less) constant; growth is at best constant. • Advice: • Need to manage complexity. • Do periodic redesigns. • Treat software and its development process as a feedback system (and not as a passive theorem).

  5. Lehman’s examples

  6. A case study in evolution:The Linux OS kernel

  7. A case study in evolution:The Linux OS kernel • It’s Linux! • Large system, very stable, many releases over several years, many developers • Growing mainstream adoption • Open source development model • Interesting phenomenon in itself • Easy to track, can publish results, many experts • Not much previous study

  8. Methodology • Examined 96 versions of Linux kernel • 34 of the 67 stable releases • 62 of the 369 development releases • All measures considered only .c/.h files contained in the tarball • Counted LOC using “wc –l” and an awk script that ignored comments and blank lines • Counted # of fcns/vars/macros using ctags • Architectural model (SSs hierarchy) based on default directory structure • We plotted growth against calendar time • Lehman suggests plotting growth against release number

  9. Growth of # of source files

  10. Growth of # of global fcns, variables, and macros

  11. Growth of Lines of Code (LOC)

  12. Average/median .c file size

  13. Average/median .h file size

  14. Growth of major SSs (dev. releases)

  15. SS LOC as percentage of total system

  16. SS LOC as percentage of total system (ignoring drivers)

  17. Growth of small core SSs

  18. Growth of arch SSs

  19. Growth of drivers SSs

  20. Observations and hypotheses • Growth along devel. path is super-linear y = .21*x^2 + 252*x + 90,055 r2=.997 y = size in LOC x = days since v1.0 r2 is “coefficient of determination” using least squares [Lehman/Turski’s model: y’ = y + E/y^2  (3Ex)^(1/3)] • Linux’s strong growth is continuing. • This is stronger growth at MLOC level than observed by others (Lehman, Gall), even for other OSs.

  21. Why has Linux been able to continue its geometric growth? • Core code quality is carefully maintained • Architecture/problem domain • It’s largely drivers • Much of the code is “parallel” • It’s not as big as you might think • Vanilla configuration used only 15% of files • Development model (OSD) and its sociology • Popularity and visibility has encouraged outsiders (both hackers and industry) to contribute

  22. Growth of pine(email client)

  23. Growth of gcc/g++/egcs

  24. Growth of vim(text editor)

  25. vim avg % comments and blank lines per file

  26. vim avg/median file size

  27. vim’s architecture

  28. Hypotheses Factors affecting evolution include • Size and age of system • Use of traditional sw. eng. principles during development PLUS • Problem domain • Problem complexity, multi-platform, multi-features • Software architecture • Process model • Sociology, market forces, and acts-of-God

  29. Software evolution research: What next? So far, we have examined only growth. • More case studies needed • Qualitative and quantitative • Industrial and open source systems • Different problem domains, architectures • Supporting tools to aid analysing, visualizing, and querying program evolution • More than just RCS and perl • Support for architecture repair • Codified knowledge: Why and how does software change? • Build catalogue of change patterns and evolutionary narratives

  30. Codified knowledge • Mature engineering disciplines codify knowledge and experience. • Arguably, this is lacking in software engineering. • Software architecture styles [Shaw] • Design patterns [GoF] • Codified knowledge of how and why programs evolve: • Evolutionary narratives [Godfrey] • Long term, coarse granularity • Change patterns • Short term, fine granularity

  31. Change patterns and evolutionary narratives • Cathedral style[Raymond] • careful control and management • debugging done before committing code • evolution is slow, planned, rarely undone • Bazaar style (OSD) • lots of low-level changes, frequent fixes • lots of “building around” rather than wholesale changing, occasional redesigns • creeping feature-itis, “complete” dependency graph

  32. Change patterns and evolutionary narratives • Band-aid evolution (just add a layer) • quick & dirty way to add new functionality, esp. if system is not well understood e.g., Y2K fixing, adding portability, new features • “Vestigial features” • design artifact persists after rationale dies e.g., whale fin bone structure resembles hand

  33. Change patterns and evolutionary narratives • Phenomena observed in Linux evolution • Bandwagon effect • Contributed third party code • “Mostly parallel” enables sustained growth • Clone and hack • Careful control of core code; more flexibility on contributed drivers, experimental features

  34. Defining, Transforming, and Exchanging High-Level SchemasA guided journey through the outback Presented by Michael W. Godfrey Software Architecture Group (SWAG) Dept of Comp Sci, Univ of Waterloo This presentation is available from http://plg.uwaterloo.ca/~migod/papers/

  35. What is a High-Level Schema? My answer: Any schema above the statement level I see two distinct levels of abstraction: • Programming language entity level • Entities are (shared) fcns, vars, types, classes, … • Architectural level • Entities are modules, subsystems, classes, interfaces, …

  36. Previous Work • Lots of • motivational work • ad hoc extractor snarfing • experimental translation mechanisms • Examples (many others exist) • CORUM I and II • GRAX • TAXForm (TA eXchange FORMat) using Acacia, Rigiparse • Rigi using VisualAge C++ • Dali using Sniff+

  37. My (selfish) goals • I would like to be able to use other extractors … • Want to perform architectural analyses of systems written in languages other than C • Want to implement BEAGLE (a tool for exploring software evolution) • … but extractors differ in languages modelled, level of detail, robustness, bugs, data format, … • I want to be able to convert data between tools. • Need agreement (awareness) from tool creators

  38. TAXForm Utopia

  39. Universal High-Level Procedural Object-Oriented PL/I C C++ Java Acacia C PBS C Rigi C Transforming Between Schemas

  40. TAXForm — Procedural schema

  41. TAXForm — High level schema

  42. Back to my (selfish) goals • Would like to concentrate on procedural and OO languages. • Others are interested in COBOL, JCL etc. • I am interested in high-level info (f calls g) • but not in ASGs, code-level metrics • Need to agree on • Syntax • Level of granularity and detail • What to do in case of X e.g., X = “missing files”

  43. My schema wish list [influenced by Acacia’s C and C++ data models] Top-level programming language entities: • functions, variables, constants, type definitions (procedural languages) • methods, class member data, static methods and member data (object-oriented languages) Entity containers: • files, modules, classes, packages

  44. My schema wish list Entity attributes: • Name, unique identifier (UID -- see next section) • UID of container, UID of containing file (if container is not a file) • Signature/data type • Line number information (see below) • Declared scope/visibility, static or not, final or not • Definition or declaration (see below) Entity container attributes: • name, UID • relative path (if a file) • version identifier (if provided) • UID of container (if not a file), UID of cont. file (if not a file)

  45. My schema wish list Relationships: • Function calls, variable uses • Line number information (see below) • Container use/inclusion (by other containers) • Inheritance (various kinds) • “Friendship”, various template relationships Relationship attributes: • Line number information (see below) • Scope/permission of inheritance

  46. Problems Some technical problems: • UID generation? (name-mangling?) • Line numbering (ranges)? • Incomplete information? • ill-formed code, gcc/K&R-isms • missing header files • resolving entity use to dfn/dcl (esp. with polymorphism, overloading) • Pre or post preprocessing?

  47. Problems We’ve had these conversations before … “Getting academics to agree on anything is like herding cats.”

  48. Included here: PBS [UWloo] Acacia [AT&T] cxref, ctags, cscope TA++ [UOttawa] BAUHAUS [UStuttgart] GUPRO [UKoblenz] Others: Rigi [UVictoria] SPOOL [UMontréal] Datrix [Bell Canada] MOOSE [UBern] SHORE [SD&M] Neuhold [UVienna] VisualAge C++ [IBM] … [many others] Example Extractors/Systems

  49. Dimensions of Variation • Intended use • Level of schema (entity level, architectural level, or mixed) • Amount of detail • Languages modelled • Multi-lingual • Common super schemas • Explicit model “cross-overs” (e.g., JCL, embedded SQL) • Hidden assumptions • Known limitations • Notation/approach to store factbase • Support for translations and transformations • What’s particularly novel and noteworthy

  50. PBS [Holt et al. @ UWaterloo] • Portable Bookshelf is a reverse engineering tool for creating software architecture models of large systems: • Guinea pigs: Mozilla, Linux, Apache, VIM, Mitel, TOBEY, … • Consists of fact extractor, fact manipulation engine (“grok”), and visualization tool (“landscape”) cfx entity-level facts grok landscape viewer source code architectural facts

More Related