1 / 21

Mining Source Code Change History for Program Understanding

Mining Source Code Change History for Program Understanding. Chadd Williams. Problem. How much do you know about your 10 year old code base? What types of bugs have been most common? Implicit rules build up over time What do you do with a return value from a function?

melva
Download Presentation

Mining Source Code Change History for Program Understanding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Source Code Change Historyfor Program Understanding Chadd Williams

  2. Problem • How much do you know about your 10 year old code base? • What types of bugs have been most common? • Implicit rules build up over time • What do you do with a return value from a function? • Didn’t someone rewrite the matrix objects? • how do you apply a transformation to an image now? • Failure understand implicit rules leads to bugs • 32% of bugs detected during maintenance1 [1] Matsumura, T., Monden, A., Matsumoto, K., The Detection of Faulty Code Violating Implicit Coding Rules, IWPSE ’02

  3. Source Code Change History • We can discover important properties of the code by looking at code changes • every change is committed • changes highlight misunderstood code • changes highlight new code • Studying each commit gives fine-grain knowledge • how quickly does a property emerge? • how fast is a property adopted? • how often is it used later?

  4. Applications • Bug finding • what types of bugs have been fixed in the past? • what functions were involved? • Return Value Check Bug Finder • Code writing • how do we use that API? • how do we access that data structure? • Function Usage Pattern Miner open(f) tmp = cnt = 0 while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmp close(f) open(f) tmp = cnt = 0 while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmp close(f) open(f) tmp = cnt = 0 while(cnt < sz & tmp != -1) tmp = read(f,sz) if(tmp != -1) cnt += tmp close(f)

  5. Return Value Check Bug • Returning error code and valid data from a function is a common C idiom • the return value should be checked before being used • lint checks for this error • This type of bug pattern has a high false positive rate • no error value returned • no useful return value int foo(){ … if( error ){ return error_code; } … return data; } … value = foo(); newPosition + = value; // ??? • Build a bug checker • improve its results with data from CVS

  6. Goal • Which are most likely true errors • where has the source code been changed to add such a check? • look at each revision of each file in CVS • flag a function as involved in a return value check in the CVS repository • Produce a ranking of the errors • group warnings by called function • rank functions that most likely need their return value checked higher CVS commit value = foo(); if( value != Error) // Check newPosition + = value; value = foo(); newPosition + = value; // ???

  7. HistoryAware Ranking • Split functions into two groups • flagged with a likely bug fix in a commit • not flagged with a likely bug fix in a commit • Rank by how often the function’s return value is checked in the latest version • current context Flagged with likely bug fix in CVS Ranked by current context data Not flagged with likely bug fix in CVS Ranked by current context data

  8. Case Studies • Does the HistoryAware ranking push likely bugs to the top? • Compare HistoryAware Ranking to Naïve Ranking • current context • Inspection criteria for warnings • functions flagged with a bug fix in a commit • functions with return value checked >50% in current context Apache web server 1,129 C source files 41,000 CVS commits Wine, OSS Windows API 3,099 C source files 70,000 CVS commits

  9. Results - Apache • Statistical Significance • Chi-square test finds the difference between the false positive rate of the CVS bug fix flagged functions and functions check > 50% in the current context to be significant

  10. Results - Wine • Statistical Significance • Chi-square test finds the difference between the false positive rate of the CVS bug fix flagged functions and functions check > 50% in the current context to be significant

  11. HDC hdc = BeginPaint( hwnd, &ps ); if( hdc ) DrawIcon( hdc, x, y, hIcon ); EndPaint( hwnd, &ps ); Called After Function Usage Pattern Miner • System specific rules that source code must follow • Function Usage Pattern • how functions are invoked with respect to each other in the source code • Find new instances of patterns added to the source code mdi = HeapAlloc(GetProcessHeap()); if (!mdi) HeapFree(GetProcessHeap(), 0, cs); Conditionally Called After

  12. Our Tool • Analyze each revision of each file • record instances of the function usage patterns • Find new instances of the patterns • instances of a pattern in a revision of a file where that instance was not found in the revision immediately prior • per file, not per function

  13. Filtering • Lots of instances identified in the Wine software repository • 50 million • Preliminary filtering heuristic • only look at pairs of functions that are separated by no more than 10 source lines • minimal control flow information computed • many APIs contain functions that are called in quick succession • error handling code is close to the error producing function

  14. Transitive Patterns • called after may be a transitive pattern • only a binary pattern • allow larger patterns to be built • may need to add more context information 5 2 1 Patterns Identified 6 4 3

  15. Preliminary Case Study • Mined Wine CVS repository • 2,175 unique patterns added to the code 10 or more times • 65 unique patterns added 100 or more times • Different categories of function pairs • Debug functionality • Heap management • Paired functionality • Error Handling wine_tsx11_lock(); XInternAtoms(thr_dis(), names, cnt, 0, atoms ); wine_tsx11_unlock(); if (RegOpenKeyA(HKEY, name, &key)) { TRACE(message); RegCloseKey(key); SetLastError(NOT_FOUND);

  16. Called After Pattern • 1,253 unique patterns added 10 or more times • Obvious patterns • serves to validate our results • Surprising patterns • point to interesting relationships between functions wndClass.hCursor = LoadCursorA (0, (LPSTR)IDC_ARROW); RegisterClassA (&wndClass); RtlDeleteCriticalSection(&det->waiters_count_lock); … HeapFree(GetProcessHeap(), 0, det);

  17. Conditionally Called After • 922 unique patterns added 10 or more times • Error handling code • conditionally report error • which functions need errors handled • Debug code • conditionally call a debug function if (!(hModule = LoadLibraryExA(fileName, 0, LLDF))) WINE_ERR("LoadLibraryExA (%s) failed, %ld\n", fileName, GetLastError());

  18. RtlHeapFree Called After RtlHeapAlloc Value: 8 dlls/kernel/heap.c dlls/ntdll/loader.c

  19. Future Work • Apply our tool to more projects • Track removed usage patterns • Better filtering heuristic • control flow based • data flow based • How do we use the patterns we find? • documentation • feed patterns to static source code checkers to find violations hdc = BeginPaint( hwnd, &ps ); if( hdc ) DrawIcon( hdc, x, y, hIcon ); EndPaint( hwnd, &ps ); hdc = BeginPaint( hwnd, &ps ); if( hdc ) DrawIcon( hdc, x, y, hIcon ); EndPaint( hwnd, &ps );

  20. Demo • Demo of the visualization tool tomorrow

More Related