K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan) APSEC@BUSAN

Overview • Problems: • Imprecisionin C tools. • High development cost of C tools. • Our solution: • Binary-level lightweight data integration. • As a testbed, DWARF2 used for developing • dxref, rxref: cross-referencers • bscg: a call-graph extractor APSEC@BUSAN

Imprecision in C tools (1/3) • e.g., GNU GLOBAL cannot identify a variable 'foo' and a label 'foo'. • Users must select some one from the list. • Because GNU GLOBAL partially analyzes source code to run very fast. int main (void) { int foo; foo: goto foo; } candidate list click foo 3 test.c int foo.c foo 4 test.c foo: goto foo; APSEC@BUSAN

Imprecision in C tools (2/3) • e.g., Murphy's study: • "An Empirical Study of Static Call Graph Extractors", by Murphy, et al., ICSE, 1996. • Tells "call graphs extracted by several broadly distributed tools vary significantly enough to surprise many experienced software engineers." APSEC@BUSAN

Imprecision in C tools (3/3 ) • Quantitative results from mosaic, quoted from Murphy's paper. cflow∩Field cflow-Field Field-cflow APSEC@BUSAN

Why imprecision? (1/2) • Reason #1: many tools partially parse source code, resulting in incomplete analysis. • e.g, GNU GLOBAL, cxref, LXR, cscope, cflow... • At a glance, full-parsing seems to solve this problem, but... APSEC@BUSAN

Why imprecision? (2/2) • Reason #2: C source code is difficult to fully analyze because of • Compiler-specific extensions. • e.g., asm for inline assembly code • Ambiguous behaviors in the C standards. • undefined, unspecified, implementation-defined. • e.g., padding in a structure. APSEC@BUSAN

Compiler-specific extensions • Essential in C and embedded software. • e.g., asm is used to obtain H/W error code. • e.g., long long is used in C89's <stdio.h> • Make it hard to analyze source code. • Different compiler has different semantics. void page_fault_handler (uint32_t error) { uint32_t cr2; asm volatile ("movl %%cr2,%0":"=r"(cr2)); ... /* IA-32 control register #2 */ } APSEC@BUSAN

Ambiguous behaviors in C (1/2) • Intentional and essential to keep C compilers fast and simple. • e.g., padding in a structure is an implementation-defined behavior. • This makes pointer-analysis hard. • "Pointer analysis for programs with structures and casts", by Suan Hsi Yong, et al, PLDI'99. APSEC@BUSAN

Ambiguous behaviors in C (2/2) struct S {char c; int *ip; } *p; struct T {char c; int i; } t; t.i = 0x1234; p = (struct S *)&t; printf ("%p\n", p->ip); • Different padding on different platforms. • To obtain precise dataflow, tools need to know the padding values of the compiler. • But it is hard... struct S struct T struct S c c c padding i ip not depends on ip Solaris8 (32bit) Solaris8 (64bit) APSEC@BUSAN

Possible solutions • To modify compilers (e.g. GCC) to emit their analyzed internal data. • Seemingly high development cost. • Many compilers to be modified. • To use binary information in executables emitted by compilers. • Relatively easy, although it lacks some information, e.g., statements. APSEC@BUSAN

Our solution and result • Our solution: • Uses DWARF2 debugging information as binary information. • Preliminary experiment: • Good result for our cross-referencers and call-graph extractor. • Better precision, although: • some false negatives increased. • quantitative results are not yet obtained. APSEC@BUSAN

Demonstration • Using DWARF2, we implemented: • two cross-referencers: • dxref: only uses DWARF2 • Sample output: dxref • rxref: hybrid of dxref and GNU GLOBAL • Sample output: dxref • a static call-graph extractor: • bscg: uses DWARF2 and disassembler. • Sample outputs: fact, dxref, bash, bash APSEC@BUSAN

DWARF2-XML C code compile text data symbol info. relocation info. debug info. dxref, rxref: cross-referencers binary ELF/ DWARF2 bscg: call graph extractor extract data integration use common format DWARF2-XML APSEC@BUSAN

How bscgworks • extract call instructionsby disassembling text. (2) convert addresses to symbols using DWARF2 1234: call 5678 main: call fact (3) trim call graphs according to options (4) output graph topologyin DOT of Graphviz digraph G { main -> fact; fact -> fact; } main fact usage APSEC@BUSAN

Advantages of bscg • Advantages of binary-level DI (explained later). • eg., high applicability and few false positives. • Can identify inlined functions. • Can extract a call from asm ("call fact"); • Can exclude • library functions: e.g., printf • system calls: e.g., open, fork • functions in runtime systems: _start, _fini APSEC@BUSAN

Disadvantages of bscg • No support for macro calls, signals, function pointers, optimization. • gprof-callgraph.pl can handle function pointers, since it uses dynamic information. • source-level ones (e.g., cflow) don't suffer from optimization problem. APSEC@BUSAN

So, is bscg good? • Yes! (not the best, of course) • Not easy to compare. APSEC@BUSAN

What is binary-level DI? • Provides common formats by extracting information from binary code. source code binary code compile *.c a.out binary DI analyze analyze common formats source DI DWARF2- XML Tools APSEC@BUSAN

Why binary-level DI? • Many advantages: • High applicability • Few false-positives. • More true-positives for low-level info. • Low development cost • Can improve C tool's precision. APSEC@BUSAN

What is lightweight DI? • Allows several common formats. • To be practical! Hard to perfectly integrate. heavy- weight DI lightweight DI DWARF2- XML APSEC@BUSAN

Summary • Imprecision in C tools. • Our solution: • Binary-level lightweight data integration. • As a testbed, DWARF2 used for developing • dxref, rxref: cross-referencers • bscg: call-graph extractor APSEC@BUSAN

Future works • Apply our technique to other tools: • e.g., memory profilers, slicers, test coverage tools, ... • Develop new binary formats suitable for lower CASE tools. • tool-information carrying code. • cf. proof-carrying code, model-carrying code, schedule-carrying code. APSEC@BUSAN

APSEC@BUSAN

Taxonomy of cross referencers. • Source-level • Partial-parsing: GNU GLOBAL, LXR, ... • Full-parsing: Sapid, ACML • Binary-level • Symbol tables: Visual Studio .NET(?) • Debug info.: dxref • Hybrid: rxref APSEC@BUSAN

What is DWARF2? • A binary format for debugging information. • Primary target languages: • C, C++, Fortran, Modula2, Pascal. • Includes: • types, nested blocks, line numbers, function/object names, addresses, stack frame information, ... APSEC@BUSAN

DWARF2-XML • Our common format in XML for DWARF2. • A testbed of binary-level lightweight DI. • Makes it easier to process DWARF2. • cf. libdwarf • About 15 times larger than DWARF2. APSEC@BUSAN

DWARF2-XML example { int i; ... } address range <section name=".debug_info"> <tag name="DW_TAG_lexical_block" offset="id:27"> <attribute name="DW_AT_low_pc" value="67328"/> <attribute name="DW_AT_high_pc" value="67356"/> ... <tag name="DW_TAG_variable" offset="id:27"> <attribute name="DW_AT_name" value="i"/> <attribute name="DW_AT_type" value_ref="id:161"> <attribute name="DW_AT_location"> <description>DW_OP_fbreg: -24</description></></></></> ... <tag name="DW_TAG_base_type" offset="id:161"> <attribute name="DW_AT_name" value="int"/> <attribute name="DW_AT_byte_size" value="4"/> <attribute name="DW_AT_encoding" value="5"> <description>signed</description></></></> variable name ID/IDREF link offset to base ptr. APSEC@BUSAN

DWARF2-XML file sizes • About 15 times larger than DWARF2. • Size increase is almost cancelled by gzip. • Consumes much memory when using DOM. • e.g., we cannot build DOM tree for gdb in our environment. • Tradeoff between memory consumption and low development cost. APSEC@BUSAN gdb's LOC is about 400,000.

Execution speed • bscg is slower than the other, but acceptable for practical use. • 12000 lines in 8.8 sec. • but too bad in the case of bash-2.03. • bscg has a problem in scalability due to heavy overhead of DOM library. APSEC@BUSAN

Why XML? • Highly readable, portable, interoperable. • plain-text and self-descriptiveness. • Powerful enough to describe complex structures and relations in programs. • Nested tags and ID/IDREF links. • DTD for checking XML documents. • Flexibility to process semi-structured documents. • Easy to query/display/modify. • XML parsers, DOM/SAX, XPath. • XPath's description is much smaller than boring tree traversal code. APSEC@BUSAN

Drawbacks in API integration e.g., libdwarf • Insufficient abstraction. • Many and various data structures/access make it hard to well encapsulate them into a fixed API. • e.g., poor API in libdwarf to traverse a wide range of data tree. (only dwarf_siblingof and dwarf_child are provided.) • High cost to implement API in many languages. • High cost to learn how to use API. APSEC@BUSAN

false/true positive/negative • false positives • tool's incorrect output. • true positives • tool's correct output. • false negatives • tool's incorrect silence. • tool should have produced output, but not. • true negatives • tool's correct silence • tool should not have produced output, and not. APSEC@BUSAN

bscg's graph trimming options APSEC@BUSAN

Why lightweight DI? • To be practical! Hard to perfectly integrate. • Supported by the fact that most technologies gave up the perfect integration/definition. • e.g., undefined behaviors in C. • e.g., GNU BFD gives API integrating different binary formats. • useful, but not perfect. • cannot convert ELF/DWARF2 into Windows PE. APSEC@BUSAN

Why function pointer analysis is difficult in C? • Pointer arithmetic and casting. • e.g., (int (*)())(base + offset) • Dynamic library • e.g., handle = dlopen (libname, RTLD_LAZY); func = dlsym (handle, funcname); f (); • Inline assembly code • e.g., asm ("call foo"); APSEC@BUSAN

CASE tools development cost • Generally very high. • individual parsers & analyzers. • internal data is less interoperable and portable • IBM Eclipse • $40,000,000 (?) APSEC@BUSAN

E.g., function pointer • Cflow • apply calls f (false positive) • gprof-callgraph.pl • apply calls add5 (true positive) • Other tools (bscg) • apply calls ? (false negative) int add5 (int x) { return x + 5; } int apply (int (*f)(int), int x) { return f (x); } int main (void) { return apply (add5, 10); } APSEC@BUSAN

Our homepage • http://www.sde.cs.titech.ac.jp/~gondow/dwarf2-xml/ • DTD for DWARF2-XML • Source code of readelf+, dxref, rxref, bscg • Some sample outputs APSEC@BUSAN

K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)

Presentation Transcript

Loosely-separated “Sister” Namespaces in Java

AIM: Provide and overview of Japanese Development

Impact of Lehman Brothers bankruptcy on post-trade system in Japan

CRITICAL SYSTEMS THINKING AS A WAY TO MANAGE KNOWLEDGE

Japan Inc.

Environmental Management System in Japan Railway

The Government of Japan

CAMECA APT Installed Base

Pray for Japan

Imperialism in Japan

Proposal to host SOCG 2014 in KYOTO, JAPAN

Special Support Education in Japan

Loosely-separated “Sister” Namespaces in Java

Dissimilarity Preserving Embedding of Objects on the Plane

October 24, 2003 Masaatsu Takehara Environmental and Social Relations Div.

Activity of K-1 Japan in APE

Japan

CRITICAL SYSTEMS THINKING AS A WAY TO MANAGE KNOWLEDGE

CRITICAL SYSTEMS THINKING AS A WAY TO MANAGE KNOWLEDGE

D2-01_12 Disaster resilient telecommunications systems for smart grid in Japan

Impact of Lehman Brothers bankruptcy on post-trade system in Japan

Loosely-separated “Sister” Namespaces in Java