1 / 50

Improving Programmer Productivity via Mining Program Source Code

Improving Programmer Productivity via Mining Program Source Code. Mining SE Data. MAIN GOAL Transform static record-keeping SE data to active data Make SE data actionable by uncovering hidden patterns and trends. Mailings. Bugzilla. Code repository. CVS. Execution traces.

ovid
Download Presentation

Improving Programmer Productivity via Mining Program Source Code

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Programmer Productivity via Mining Program Source Code

  2. Mining SE Data • MAIN GOAL • Transform static record-keeping SE data to active data • Make SE data actionable by uncovering hidden patterns and trends Mailings Bugzilla Code repository CVS Executiontraces T. Xie Mining Program Source Code

  3. Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining classification association/patterns clustering … data mining techniques code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code

  4. Overview of Mining SE Data 99 ASE 00 ICSE 05 FSE*2 ASE PLDI POPL OSDI 06 PLDI OOPSLA KDD07 ICSE*3 FSE*3 ASE PLDI*2 ISSTA*2 KDD 99 FSE 01 ICSE FSE 02 ISSTA POPL KDD 03 PLDI 04 ASE ISSTA05 ICSE ASE 06 ICSE FSE*2 07 PLDI 99 ICSE02 ICSE 03 PLDI 05 FSE PLDI06 ISSTA07 ISSTA 03 ICSE06 ICSE 06 ASE 07 ICSE SOSP 04 ICSE05 FSE*2 06 ASE 07 ICSE*2 code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code

  5. Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining classification association/patterns clustering … data mining techniques code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code

  6. Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining 02 KDD04 ICSE ASE 05 FSE ASE*2 06 KDD 07 ICSE*3 99 ASE 00 ICSE 05 FSE PLDI POPL 06 FSE OOPSLA PLDI 07 FSE ASE ISSTA KDD 01 SOSP 04 OSDI 05 FSE*2 06 ICSE*2 07 ICSE*2 FSE*2 ISSTA PLDI*2 SOSP 03 ICSE PLDI*2 05 ICSE FSE ASE PLDI 06 ICSE FSE 07 ICSE ISSTA PLDI 99 ICSE01 ICSE*2 FSE 02 ICSE ISSTA POPL 04 ISSTA06 ISSTA T. Xie Mining Program Source Code

  7. Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining classification association/patterns clustering … data mining techniques code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code

  8. Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code

  9. Some Recent Trends • Data: dynamic execution data  +static code bases • Task: productivity (programming)  + quality (defect detection, testing, debugging) • Mining algorithm: simple ones (association rule)  + frequent itemset/subsequence/ partial order/subgraph • Data scope: local repositories  public repositories with code search engines T. Xie Mining Program Source Code

  10. Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code

  11. Mining API Usage Patterns • How should an API be used correctly? • An API may serve multiple functionalities • Different styles of API usage • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] T. Xie Mining Program Source Code

  12. Example Task -- MAPO • “instrument the bytecode of a Java class by adding an extra method to the class” • org.apache.bcel.generic.ClassGen public void addMethod(Method m) T. Xie Mining Program Source Code

  13. First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add T. Xie Mining Program Source Code

  14. Second Try:Code Search Engine T. Xie Mining Program Source Code

  15. MAPO Approach • Analyze code segments relevant to a given API and disclose the inherent usage patterns • Input: an API characterized by a method, class, or package • Code search engine: used to search relevant source files from open source repositories • Frequent sequence miner: use BIDE [Wang&Han 04] to mine closed sequential patterns from extracted method-call sequences • Output: a short list of frequent API usage patterns related to the API T. Xie Mining Program Source Code

  16. Sequence Extraction • Method sequences: extracted from Java source files returned from code search engines Source code Call sequence public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … } InstructionList.<init>() genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method)PrintStream.println(String) … T. Xie Mining Program Source Code

  17. Sequence Preprocessing • Remove common Java library calls • Inline callees of the same class • Remove sequences that contain no query words: ClassGen and addMethod public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … } InstructionList.<init>() genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method)PrintStream.println(String) … T. Xie Mining Program Source Code

  18. Frequent Seq Postprocessing • Remove sequences that contain no query words: ClassGen and addMethod • Compress consecutive calls of the same method into one, e.g., abbba  aba • Remove duplicate frequent sequences after the compression, e.g., aba, aba  aba • Reduce a seq if it is a subseq of another, e.g., aba, abab  abab T. Xie Mining Program Source Code

  19. Tool Architecture e.g. koders.com T. Xie Mining Program Source Code

  20. Sample Mined API Sequence InstructionList.<init>() InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose() T. Xie Mining Program Source Code

  21. Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code

  22. Mining API Usage Patterns • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] • Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] T. Xie Mining Program Source Code

  23. Usage Patterns as Partial Order a  b  d  e a  b  d  f a  c  d  e a  c  d  f #include <abcdef.h> void p ( ) { b ( ); c ( ); } void q ( ) { c ( ); b ( ); } void r ( ) { e ( ); f ( ); } void s ( ) { f ( ); e ( ); } int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } } (c) Frequent subseq patterns 1 a  f  e  c 2 a  b  c  d  e  f 3 a  c  b  d  e  f 4 a  b  c  d  f  e 5 a  c  b  d  f  e a b c (b) Static program traces d e f (d) Frequent partial order R (a) Example code T. Xie Mining Program Source Code

  24. Apiartor Overview Scenario Extractor User-specified APIs Independent Scenarios Trace Generator Miner Trigger Generator Related APIs Triggers Partial Orders Frequent Usage Scenarios Model Checker Specification Extractor Source Code Specifications Traces T. Xie Mining Program Source Code

  25. Example Partial Orders XOpenDisplay XCreateWindow XCreateGC XGetWindowAttributes XSelectInput XMapWindow XSetForeground XGetBackground A usage scenario around XOpenDisplay API as a partial order. Specifications are shown with dotted lines. XChageWindowAttributes XNextEvent XMapWindow XGetAtomName XFreeGC XCloseDisplay T. Xie Mining Program Source Code

  26. Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code

  27. Mining API Usage Patterns • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] • Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] • PARSEWeb: “I know what type of object I need, but I don’t know how to write the code to get the object” [Thummalapenta&Xie ASE 07] T. Xie Mining Program Source Code

  28. Example Task - OpenJMS Sun Java Message Services API Spec • Query: “javax.jms.QueueConnectionFactory -> javax.jms.QueueSender” • PARSEWeb Solution: FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23 Confidence:True Path: 1 2 3 javax.jms.QueueConnectionFactory,createQueueConnection() ReturnType:javax.jms.QueueConnection javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO ACKNOWLEDGE) ReturnType:javax.jms.QueueSession javax.jms.QueueSession,createSender(javax.jms.Queue) ReturnType:javax.jms.QueueSender T. Xie Mining Program Source Code

  29. PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code

  30. PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code

  31. Code Analyzer • Collect [Source  Destination] method sequences invoked by each public method • Deal with local method calls by inlining methods • Deal with conditionals/loops by traversing control flow graphs • Resolve types in sequences • Challenges: downloaded files are partial • Solutions: heuristics are developed T. Xie Mining Program Source Code

  32. Type Heuristics • Heuristic 1: The return type of a method-invocation statement contained in an initialization expression is same as the type of the declared variable. e.g., QueueConnection connect; QueueSession session = connect.createQueueSession(false,int) • Heuristic 2: The return type of an outer most method-invocation contained in a return statement is same as the return type of the enclosing method declaration. e.g., public int test() { ... return connect.createQueueSession(false,int); } T. Xie Mining Program Source Code

  33. PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code

  34. Sequence Miner • Candidate sequences produced by the code analyzer may be too many Solutions: • Cluster similar sequences • Clustering heuristics are developed • Rank sequences • Ranking heuristics are developed T. Xie Mining Program Source Code

  35. Clustering Heuristics • Heuristic 1: Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order. e.g., ''2 3 4 5'' and ''2 4 3 5 '' • Heuristic 2: Method-invocation sequences differing by given cluster precision value can be considered similar. e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under cluster precision value one. T. Xie Mining Program Source Code

  36. Ranking Heuristics • Heuristic 1: Higher frequency -> Higher rank • Heuristic 2: Shorter length -> Higher rank T. Xie Mining Program Source Code

  37. PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code

  38. Query Splitter • Lack of code samples in the results of code search engines • Code samples are split among different files Solution: • Split the user query into multiple queries • Compose the results for each split query T. Xie Mining Program Source Code

  39. Query Splitting Example 1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream” Results: None 2. Query: “java.io.ObjectInputStream” Results: 3. Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream 3. Three Queries to be fired: “org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream” Results: 1 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream” Results: None T. Xie Mining Program Source Code

  40. Eclipse Plugin T. Xie Mining Program Source Code

  41. Evaluations • Real Programming Problems: To address problems posted in developer forums. • Real Projects: To show that solutions recommended by PARSEWeb are • available in real projects • better than solutions recommended by related tools PROSPECTOR, Strathcona, Google Code Search Engine averagely T. Xie Mining Program Source Code

  42. Jakarta BCEL User Forum • Jakarta BCEL user forum, 2001 Problem: “How to disassemble java byte code” Query: “Code  Instruction” Solution Sample Code: Code code; InstructionList il = new InstructionList(code.getCode()); Instruction[] ins = il.getInstructions(); T. Xie Mining Program Source Code

  43. Dev2Dev Newsgroups • Dev 2 Dev Newsgroups, 2006 Problem: “how to connect db by sesseionBean” Query: javax.naming.InitialContext  java.sql.Connection Solution Sequence: FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1 NumberOfOccurrences:34 javax.naming.InitialContext,lookup(java.lang.String) ReturnType:javax.sql.DataSource javax.sql.DataSource,getConnection() ReturnType:java.sql.Connection T. Xie Mining Program Source Code

  44. Challenges in Mining Code • Sometimes too few data samples • Scalability is usually not an issue • Static code bases vs. change histories • Data preparation/preprocessing • Related to traditional program analysis • Pattern postprocessing (filtering and ranking) • Heuristics play important roles • Demand-driven mining vs. any gold mining • Programming vs. bug finding T. Xie Mining Program Source Code

  45. Conclusion • Mining various types of software engineering data to aid software engineering task • Mining program source code to improve programmer productivity • MAPO: mining API usage patterns for a given API • Apiartor: mining API usage patterns for a given set of APIs • PARSEWeb: mining API usage patterns for input-output-type quries T. Xie Mining Program Source Code

  46. Questions? • Mining Software Engineering Data Bibliographyhttp://ase.csc.ncsu.edu/dmse/ • What software engineering tasks can be helped by data mining? • What kinds of software engineering data can be mined? • How are data mining techniques used in software engineering? • Resources

  47. Demand-Driven Or Not T. Xie Mining Program Source Code

  48. Code vs. Non-Code T. Xie Mining Program Source Code

  49. Static vs. Dynamic T. Xie Mining Program Source Code

  50. Snapshot vs. Changes T. Xie Mining Program Source Code

More Related