Andrew G. West and Insup Lee August 28, 2012

Andrew G. West and Insup Lee August 28, 2012 Towards Content-Driven Reputation for Collaborative Code Repositories

Big Concept • Do the computed reputations accurately reflect user behavior? If so, how could such a system be useful in practice? • What do inaccuracies teach us about differences in the evolution of code vs. natural language content? Adaptation? Apply reputation algorithms developed for wikis in collaborative code repositories:

Motivations Platform equivalence • Purely collaborative • Increasingly distributed; collaboration between unknown/un-trusted parties • VehicleForge.mil[1] • Crowdsourcing a next generation military vehicle • Trust implications!

CONTENT-DRIVEN REPUATION

Content Driven Rep. Authors V0 V1 A1 Article Version History Mr. Franklin flew a kite Initialization IDEA: Content that survives is good content. Good content is written/maintained by good authors. V1: No reputation changes; no survival

Content Driven Rep. Authors V2 V3 V0 V1 V4 A1 A2 A3 A4 Article Version History Mr. Franklin flew a kite Your mom flew a plane Damage Initialization IDEA: When a subsequent editor allows content to survive, it has his/her implicit approval (and vice versa) V2: Author A2 deletes most of A1’s content. Reputation of A1 is negatively impacted.

Content Driven Rep. Authors V2 V3 V0 V1 V4 A1 A2 A3 A4 Article Version History Mr. Franklin flew a kite Mr. Franklin flew a kite Your mom flew a plane Damage Initialization Content Restoration IDEA: Survival is examined at depth V3: Author A3 reverts A2’s content. Editor A1 gains reputation as his content is restored, A2 loses rep.

Content Driven Rep. Authors V2 V3 V0 V1 V4 A1 A2 A3 A4 Article Version History Mr. Franklin flew a kite and … Mr. Franklin flew a kite Mr. Franklin flew a kite Your mom flew a plane Damage Content Persistence Initialization Content Restoration IDEA: … and the process continues (depth=10) V4: Author A1 and A3 accrue reputation, while A2 continues to receive reputation decrements.

In Practice Implemented as WikiTrust [2, 3] • Token survival + edit distance captures novel content as well as maintenance actions • Size of ∆ is: (1) proportional to degree of change, (2) weighted by the rep. of the editor • Nice security properties • Implicit feedback • Symmetric evaluation • No self approval

WikiTrust Success Live processing several language editions of Wikipedia; portable! Implementation [4] works on any MediaWiki installation VANDALISM

REPRESENTING AREPOSITORY ONA WIKI PLATFORM

Repo. ↔ Wiki Model branches/ 2 5 merge trunk/ 1 3 6 9 tags/ 4 7 • Just replay history in a sequential fashion: • Repository ↔ wiki • Check-in ↔ edit • File ↔ article

Repo. ↔ Wiki Model Minor accommodations: • Ignore tags • Ignore branches (merge as a recommendation) • Multi-file check-in branches/ 2 5 merge trunk/ 1 3 6 9 tags/ 4 7 • Just replay history in a sequential fashion: • Repository ↔ wiki • Check-in ↔ edit • File ↔ article

Replay in Practice • [svnsync] produces local copy (not a checkout) • [svn log] yields metadata script (see table) • Pipe file versions into wiki via API • Log-in user (create account if needed) • Use [svn cat path@id] syntax to yield content • Make edit to article “path”. Logout.

CASE STUDYINTRODUCTION

Mediawiki SVN • Case study repository: Mediawiki SVN [5] • http://hincapie.cis.upenn.edu/wiki_mediawiki/ • Further filtering: • Only PHP files • Core language • No binary files • Tokenization • Toss out i18n files per late 2011

Mediawiki SVN • Case study repository: Mediawiki SVN [X] • http://hincapie.cis.upenn.edu/wiki_mediawiki/ Wiki database is givento WikiTrust implementation: Revision #A by J had ∆+0.75 on reputation of X=12.05 Revision #B by K had ∆-42.00 on reputation of Y=0.5 Revision #B by K had ∆+16.75 on reputation of Z=1000.1 … … … Recall: An edit can change up to 10 reputations! • Further filtering: • Only PHP files • Core language • No binary files • Tokenization • Toss out i18n files

General Results (1) Distribution of Final User Reputations • Reputations lie on [0,20k] • 0.0 is the initial rep. • ≈15 users w/max. rep. Not always those w/most revs.

General Results (2) Distribution of Update ∆s, by Magnitude • Majority of updates are positive; evidence of a healthy community • Most freq. update is 1-10 pt. increment

Example Reputations

EVALUATING REPUTATION ACCURACY

Evaluation Process Find edits (Ex) where: • Subsequent edit (Ex+1) resulted in non-trivial rep. loss for author • Manually inspect comment, Bugzilla, diffs, and ask: “Would editor Ax+1 consider the previous change CONSTRUCTIVE, or UNCONSTRUCTIVE”? • Could be a subjective mess, but… Ex Non-trivialcontentremoval Ex+1 Was this removal the result of ineptitude by the prior editor?

Classifying Rep. Loss (1) Surprising number of obviously “bad” actions resulting in reverts. Editor calls out previous edit and/or editor explicitly: “Password i n plaintext! … DOESN'T WORK … don't put it in trunk!” “massive breakage with incomplete experimental changes” “revert … spewing giant red HTML all over everything” “failed, possibly other problems. NEEDS PARSER TESTS” “ten billion extra callouts …. clutter things up and trigger errors” “… no apparent purpose … more complex and prone to breakage”

Classifying Rep. Loss (2) Some cases are more ambiguous. The editor erred but its not immediately clear there should be significant penalty (NONFATAL): • Code showing no immediate errors: • But reverted (or branched) for testing • Issues unrelated to functional code: • Whitespace, comment/string changes

Evaluation Results Per a conservative approach, anything not in the other two sets is CONSTRUCTIVE: 63% accuracy if we discount the “non-fatal” cases 70% accuracy if we interpret them as “unconstructive” Interpret how you wish; purposely a naïve application Concentrate on false-positives:Can the algorithm be improved?

IDENTIFYING & FIXINGFALSE POSITIVES +EVALUATION

False Positives (1) SVN does not handle RENAME elegantly: DEL file.c Consequences: Authors of [file.c] punished; provenance lost; renamer gets all credit. Solutions: Detect via hash; simple wiki “move” file_renamed.c ADD

False Positives (2.1) INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust file1.c >> file2.c >> file3.c >> ... file_1.c file_2.c func_b(){…} func_c(){…} func_c(){…} …… • Solution: Examine all diff ∆; sub-string matching; replay history. --- ∆ +++ ∆ Entire code-base as one giant doc. –global diff! • Intra-doc reorg. is a non-issue!

False Positives (2.2) INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust [This is the content block being moved] Destination doc. history file1.c >> file2.c >> file3.c >> ... A3 A2 A3– V3 file_1.c file_2.c A1 A2 – V2 ! A1 – V1 func_b(){…} func_c(){…} func_c(){…} …… Solution: Intra-document reorg. is non-issue!; Global diff; substring matching; replay history. [This is the same block 3 edits ago] --- ∆ +++ ∆ Entire code-base as one giant doc.

False Positives (2.3) INTER-DOCUMENT REORGANIZATION is problematic for WikiTrust TRANSCLUSION! Old doc. New doc. text {{sect}} text sec. txt. sec. txt. sec. txt. file1.c >> file2.c >> file3.c >> ... A3 A2 file_1.c file_2.c A3 A1 func_b(){…} func_c(){…} func_c(){…} …… Solution: Intra-document reorg. is non-issue!; Global diff; substring matching; replay history. A2 A1 --- ∆ +++ ∆ Entire code-base as one giant doc.

False Positives (3) REVERT CHAINS cause big penalties: +++ BIG CODE CHANGES “Revert: Needs testing first” +++ BIG CODE CHANGES V0 V1 V2 V3 testing done identical nearly identical Consequences: At V2, A1 loses reputation (a NONFATAL). Solution: Revert chains rare; manual inspection? At V3, A2 is wrongly punished.

False Positives (4) • Initially 30 false positive cases • If “solutions” were implemented • This number would be just 10 • Suggestions accuracies of 80-90% • And those 10 cases? • Benign code evolution • Feature requests; method deprecation; no fault • Results similar for [ruby] and [httpd]

Better Evaluation • POC evaluation lacking in many ways • Not enough examples. Subjective. • Says nothing about true negatives • Bug attribution is extremely difficult • Corpus: “X erred at rev. Y with severity {L,M,H}” • If it could be automated; problem solved! • Work backwards from Bugzilla? Developers? • Reputation as a predictor of future loss events. • Qualitative instead of quantitative measures

Other Optimization • Lots of free variables, weights, ceilings Canonical code // this is a loop for(inti=0;i<10;i++) print(“Some text”); for ( inti = 0 ; i < 10 ; i++ ){ print( “” ); } for ( inti = 0; i < 10; i++ ){ print( “” ); } Tokenization for ( inti = 0 ; i < 10 ; i++ ){ print( “” ); }

USE-CASES &CONCLUSIONS

Use-case: Small Projects • Small/non-production proj. • Conflict, not just tokens! • Undergraduate research • Who did all the work? • Academic paper repositories • Automatic author order! • Collaboration or conflict? • Graph of reputation events Faction #1 + A B + - - - - - - C D + + Faction #2

Use-cases (2) MEDIAWIKI • Alert service/warnings (anti-vandal style) • Expediting code review • Permission granting/revocation

Use-cases (2) MEDIAWIKI • Alert service/warnings (anti-vandal style) • Expediting code review • Permission granting/revocation VEHICLEFORGE.MIL • Access control for users/commits • Wrap content-persistent reputation with metadata features for a stronger classifier [6] • Robustness considerations (i.e., reach-ability)

Conclusions • Despite high-(er) barriers to entry, bad things still happen in production repositories! • Content-persistence is a reasonably accurate way to identify these instances ex post facto • False positives indicate code uniqueness: • 1. Non-functional aspects are non-trivial (WS, comments) • 2. Inter-document reorganization is common • 3. Quality-assurance is more than surface level • Evaluation needs to be more rigorous • A variety of use-cases if it becomes production-ready

References [1] Lohr, Steve. “Pentagon Pushes Crowdsourced Manufacturing”. New York Times “Bits Blog”. April 5, 2012. [2] Adler, B.T. and L. de Alfaro. “A Content-Driven Reputation System for Wikipedia”. In WWW 2007: Proc. of the 16th Intl. World Wide Web Conference. [3] Adler, B.T., et al. “Measuring Author Contributions to Wikipedia”. In WikiSym 2008: Proc. of the 3rd Intl. Symposium on Wikis and Open Collaboration. [4] WikiTrust online. http://www.wikitrust.net/ [5] Mediawiki SVN. http://svn.wikimedia.org/viewvc/mediawiki/(note: this an archive of that resource, Git is the currently used repository software) [6] Adler, B.T. et al. “Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features”. In CICLing 2011: Proc. of the 12th Intl. Conference on Intelligent Text Processing and Computational Linguistics. [Ø] Mediawiki Developer Hub. http://www.mediawiki.org/wiki/Developer_hub

Andrew G. West and Insup Lee August 28, 2012

Andrew G. West and Insup Lee August 28, 2012

Presentation Transcript

Andrew Leun g

Andrew Leun g

Andrew Leun g

AUGUST 28, 2012 ENGLISH II

Tuesday, August 28 th , 2012

Public Hearing August 28, 2012

28 AUGUST 2012

Andrew G. West Wikimania `11 – August 5, 2011

Andrew G. West , Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee

Andrew G. West , Avantika Agrawal, Phillip Baker, Brittney Exline, and Insup Lee

August 28 , 2012

Taipei – 28 August 2012

28 August 2012

Andrew G. West Wikimania `11 – August 5, 2011

Andrew G. West and Insup Lee CEAS `11 – September 1, 2011

Senior Information Night August 28, 2012

Faculty Orientation August 28, 2012

Andrew Leun g

28 August 2012

28 AUGUST 2012

FAB8ZN 22-28 AUGUST 2012