1 / 39

K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C. K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan). Overview. Problems: Imprecision in C tools. High development cost of C tools.

eloise
Download Presentation

K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan) APSEC@BUSAN

  2. Overview • Problems: • Imprecisionin C tools. • High development cost of C tools. • Our solution: • Binary-level lightweight data integration. • As a testbed, DWARF2 used for developing • dxref, rxref: cross-referencers • bscg: a call-graph extractor APSEC@BUSAN

  3. Imprecision in C tools (1/3) • e.g., GNU GLOBAL cannot identify a variable 'foo' and a label 'foo'. • Users must select some one from the list. • Because GNU GLOBAL partially analyzes source code to run very fast. int main (void) { int foo; foo: goto foo; } candidate list click foo 3 test.c int foo.c foo 4 test.c foo: goto foo; APSEC@BUSAN

  4. Imprecision in C tools (2/3) • e.g., Murphy's study: • "An Empirical Study of Static Call Graph Extractors", by Murphy, et al., ICSE, 1996. • Tells "call graphs extracted by several broadly distributed tools vary significantly enough to surprise many experienced software engineers." APSEC@BUSAN

  5. Imprecision in C tools (3/3 ) • Quantitative results from mosaic, quoted from Murphy's paper. cflow∩Field cflow-Field Field-cflow APSEC@BUSAN

  6. Why imprecision? (1/2) • Reason #1: many tools partially parse source code, resulting in incomplete analysis. • e.g, GNU GLOBAL, cxref, LXR, cscope, cflow... • At a glance, full-parsing seems to solve this problem, but... APSEC@BUSAN

  7. Why imprecision? (2/2) • Reason #2: C source code is difficult to fully analyze because of • Compiler-specific extensions. • e.g., asm for inline assembly code • Ambiguous behaviors in the C standards. • undefined, unspecified, implementation-defined. • e.g., padding in a structure. APSEC@BUSAN

  8. Compiler-specific extensions • Essential in C and embedded software. • e.g., asm is used to obtain H/W error code. • e.g., long long is used in C89's <stdio.h> • Make it hard to analyze source code. • Different compiler has different semantics. void page_fault_handler (uint32_t error) { uint32_t cr2; asm volatile ("movl %%cr2,%0":"=r"(cr2)); ... /* IA-32 control register #2 */ } APSEC@BUSAN

  9. Ambiguous behaviors in C (1/2) • Intentional and essential to keep C compilers fast and simple. • e.g., padding in a structure is an implementation-defined behavior. • This makes pointer-analysis hard. • "Pointer analysis for programs with structures and casts", by Suan Hsi Yong, et al, PLDI'99. APSEC@BUSAN

  10. Ambiguous behaviors in C (2/2) struct S {char c; int *ip; } *p; struct T {char c; int i; } t; t.i = 0x1234; p = (struct S *)&t; printf ("%p\n", p->ip); • Different padding on different platforms. • To obtain precise dataflow, tools need to know the padding values of the compiler. • But it is hard... struct S struct T struct S c c c padding i ip not depends on ip Solaris8 (32bit) Solaris8 (64bit) APSEC@BUSAN

  11. Possible solutions • To modify compilers (e.g. GCC) to emit their analyzed internal data. • Seemingly high development cost. • Many compilers to be modified. • To use binary information in executables emitted by compilers. • Relatively easy, although it lacks some information, e.g., statements. APSEC@BUSAN

  12. Our solution and result • Our solution: • Uses DWARF2 debugging information as binary information. • Preliminary experiment: • Good result for our cross-referencers and call-graph extractor. • Better precision, although: • some false negatives increased. • quantitative results are not yet obtained. APSEC@BUSAN

  13. Demonstration • Using DWARF2, we implemented: • two cross-referencers: • dxref: only uses DWARF2 • Sample output: dxref • rxref: hybrid of dxref and GNU GLOBAL • Sample output: dxref • a static call-graph extractor: • bscg: uses DWARF2 and disassembler. • Sample outputs: fact, dxref, bash, bash APSEC@BUSAN

  14. DWARF2-XML C code compile text data symbol info. relocation info. debug info. dxref, rxref: cross-referencers binary ELF/ DWARF2 bscg: call graph extractor extract data inte- gration use common format DWARF2-XML APSEC@BUSAN

  15. How bscgworks • extract call instructionsby disassembling text. (2) convert addresses to symbols using DWARF2 1234: call 5678 main: call fact (3) trim call graphs according to options (4) output graph topologyin DOT of Graphviz digraph G { main -> fact; fact -> fact; } main fact usage APSEC@BUSAN

  16. Advantages of bscg • Advantages of binary-level DI (explained later). • eg., high applicability and few false positives. • Can identify inlined functions. • Can extract a call from asm ("call fact"); • Can exclude • library functions: e.g., printf • system calls: e.g., open, fork • functions in runtime systems: _start, _fini APSEC@BUSAN

  17. Disadvantages of bscg • No support for macro calls, signals, function pointers, optimization. • gprof-callgraph.pl can handle function pointers, since it uses dynamic information. • source-level ones (e.g., cflow) don't suffer from optimization problem. APSEC@BUSAN

  18. So, is bscg good? • Yes! (not the best, of course) • Not easy to compare. APSEC@BUSAN

  19. What is binary-level DI? • Provides common formats by extracting information from binary code. source code binary code compile *.c a.out binary DI analyze analyze common formats source DI DWARF2- XML Tools APSEC@BUSAN

  20. Why binary-level DI? • Many advantages: • High applicability • Few false-positives. • More true-positives for low-level info. • Low development cost • Can improve C tool's precision. APSEC@BUSAN

  21. What is lightweight DI? • Allows several common formats. • To be practical! Hard to perfectly integrate. heavy- weight DI light- weight DI DWARF2- XML APSEC@BUSAN

  22. Summary • Imprecision in C tools. • Our solution: • Binary-level lightweight data integration. • As a testbed, DWARF2 used for developing • dxref, rxref: cross-referencers • bscg: call-graph extractor APSEC@BUSAN

  23. Future works • Apply our technique to other tools: • e.g., memory profilers, slicers, test coverage tools, ... • Develop new binary formats suitable for lower CASE tools. • tool-information carrying code. • cf. proof-carrying code, model-carrying code, schedule-carrying code. APSEC@BUSAN

  24. APSEC@BUSAN

  25. Taxonomy of cross referencers. • Source-level • Partial-parsing: GNU GLOBAL, LXR, ... • Full-parsing: Sapid, ACML • Binary-level • Symbol tables: Visual Studio .NET(?) • Debug info.: dxref • Hybrid: rxref APSEC@BUSAN

  26. What is DWARF2? • A binary format for debugging information. • Primary target languages: • C, C++, Fortran, Modula2, Pascal. • Includes: • types, nested blocks, line numbers, function/object names, addresses, stack frame information, ... APSEC@BUSAN

  27. DWARF2-XML • Our common format in XML for DWARF2. • A testbed of binary-level lightweight DI. • Makes it easier to process DWARF2. • cf. libdwarf • About 15 times larger than DWARF2. APSEC@BUSAN

  28. DWARF2-XML example { int i; ... } address range <section name=".debug_info"> <tag name="DW_TAG_lexical_block" offset="id:27"> <attribute name="DW_AT_low_pc" value="67328"/> <attribute name="DW_AT_high_pc" value="67356"/> ... <tag name="DW_TAG_variable" offset="id:27"> <attribute name="DW_AT_name" value="i"/> <attribute name="DW_AT_type" value_ref="id:161"> <attribute name="DW_AT_location"> <description>DW_OP_fbreg: -24</description></></></></> ... <tag name="DW_TAG_base_type" offset="id:161"> <attribute name="DW_AT_name" value="int"/> <attribute name="DW_AT_byte_size" value="4"/> <attribute name="DW_AT_encoding" value="5"> <description>signed</description></></></> variable name ID/IDREF link offset to base ptr. APSEC@BUSAN

  29. DWARF2-XML file sizes • About 15 times larger than DWARF2. • Size increase is almost cancelled by gzip. • Consumes much memory when using DOM. • e.g., we cannot build DOM tree for gdb in our environment. • Tradeoff between memory consumption and low development cost. APSEC@BUSAN gdb's LOC is about 400,000.

  30. Execution speed • bscg is slower than the other, but acceptable for practical use. • 12000 lines in 8.8 sec. • but too bad in the case of bash-2.03. • bscg has a problem in scalability due to heavy overhead of DOM library. APSEC@BUSAN

  31. Why XML? • Highly readable, portable, interoperable. • plain-text and self-descriptiveness. • Powerful enough to describe complex structures and relations in programs. • Nested tags and ID/IDREF links. • DTD for checking XML documents. • Flexibility to process semi-structured documents. • Easy to query/display/modify. • XML parsers, DOM/SAX, XPath. • XPath's description is much smaller than boring tree traversal code. APSEC@BUSAN

  32. Drawbacks in API integration e.g., libdwarf • Insufficient abstraction. • Many and various data structures/access make it hard to well encapsulate them into a fixed API. • e.g., poor API in libdwarf to traverse a wide range of data tree. (only dwarf_siblingof and dwarf_child are provided.) • High cost to implement API in many languages. • High cost to learn how to use API. APSEC@BUSAN

  33. false/true positive/negative • false positives • tool's incorrect output. • true positives • tool's correct output. • false negatives • tool's incorrect silence. • tool should have produced output, but not. • true negatives • tool's correct silence • tool should not have produced output, and not. APSEC@BUSAN

  34. bscg's graph trimming options APSEC@BUSAN

  35. Why lightweight DI? • To be practical! Hard to perfectly integrate. • Supported by the fact that most technologies gave up the perfect integration/definition. • e.g., undefined behaviors in C. • e.g., GNU BFD gives API integrating different binary formats. • useful, but not perfect. • cannot convert ELF/DWARF2 into Windows PE. APSEC@BUSAN

  36. Why function pointer analysis is difficult in C? • Pointer arithmetic and casting. • e.g., (int (*)())(base + offset) • Dynamic library • e.g., handle = dlopen (libname, RTLD_LAZY); func = dlsym (handle, funcname); f (); • Inline assembly code • e.g., asm ("call foo"); APSEC@BUSAN

  37. CASE tools development cost • Generally very high. • individual parsers & analyzers. • internal data is less interoperable and portable • IBM Eclipse • $40,000,000 (?) APSEC@BUSAN

  38. E.g., function pointer • Cflow • apply calls f (false positive) • gprof-callgraph.pl • apply calls add5 (true positive) • Other tools (bscg) • apply calls ? (false negative) int add5 (int x) { return x + 5; } int apply (int (*f)(int), int x) { return f (x); } int main (void) { return apply (add5, 10); } APSEC@BUSAN

  39. Our homepage • http://www.sde.cs.titech.ac.jp/~gondow/dwarf2-xml/ • DTD for DWARF2-XML • Source code of readelf+, dxref, rxref, bscg • Some sample outputs APSEC@BUSAN

More Related