1 / 27

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems. Kim Hazelwood Greg Lueck Robert Cohn. counter++;. counter++;. counter++;. counter++;. counter++;. Dynamic Binary Instrumentation. sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi

Download Presentation

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn

  2. counter++; counter++; counter++; counter++; counter++; Dynamic Binary Instrumentation • sub $0xff, %edx • cmp %esi, %edx • jle <L1> • mov $0x1, %edi • add $0x10, %eax • Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count

  3. Instruction Count Output $ /bin/lsMakefile imageload.out itrace proccount imageload inscount atrace itrace.out $ pin -t inscount.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out • Count 422838

  4. EXE Transform Profile Code Cache Execute How Does it Work? • Generates and caches modified copies of instructions • Modified (cached) instructions are executed in lieu of original instructions

  5. Instr 1 Instr 2 Instr 3 Jump Reg DATA Instr 5 Instr 6 Uncond Branch PADDING Instr 8 Why “Dynamic” Instrumentation? • Robustness! • No need to recompile or relink • Discover code at runtime • Handle dynamically-generated code • Attach to running processes The Code Discovery Problem on x86 Indirect jump to ?? Data interspersed with code Pad for alignment

  6. Intel Pin • A dynamic binary instrumentation system • Easy-to-use instrumentation interface • Supports multiple platforms • Four ISAs – IA32, Intel64, IPF, ARM • Four OSes – Linux, Windows, FreeBSD, MacOS • Popular and well supported • 32,000+ downloads • 400+ citations • 500+ mailing list subscribers 5

  7. Research Applications • Gather profile information about applications • Compare programs generated by competing compilers • Generate a select stream of live information for event-driven simulation • Add security features • Emulate new hardware • Anything and everything multicore

  8. The Problem with Modern Tools • Many research tools do not support multithreaded guest applications • Providing support for MT apps is mostly straightforward • Providing scalable support can be tricky!

  9. Issues that Arise • Gaining control of executing threads • Determining what should be private vs. shared between threads • Code cache maintenance and consistency • Concurrent instruction writes • Providing/handling thread-local storage • Handling indirect branches • Handling signals / system calls

  10. The Pin Architecture Pin Tool Instrumentation Code Call-Back Handlers Analysis Code Pin T1 T1 T2 JIT Compiler Dispatcher Code Cache T1 T1 Syscall Emulator T2 Signal Emulator Serialized Parallel

  11. Code Cache Consistency • Cached code must be removed for a variety of reasons: • Dynamically unloaded code • Ephemeral/adaptive instrumentation • Self-modifying code • Bounded code caches EXE Transform Profile Code Cache Execute

  12. Motivating a Bounded Code Cache • The Perl Benchmark

  13. Flushing the Code Cache • Option 1: All threads have a private code cache (oops, doesn’t scale) • Option 2: Shared code cache across threads • If one thread flushes the code cache, other threads may resume in stale memory

  14. Naïve Flush • Wait for all threads to return to the code cache • Could wait indefinitely! Flush Delay VM CC1 VM stall CC2 Thread1 VM CC1 VM stall CC2 Thread2 VM CC1 VM CC2 Thread3 Time

  15. VM CC1 VM CC2 VM CC1 VM CC2 VM CC1 VM CC2 Generational Flush • Allow threads to continue to make progress in a separate area of the code cache Thread1 Thread2 Thread3 Time • Requires a high water mark

  16. Memory Scalability of the Code Cache • Ensuring scalability also requires carefully configuring the code stored in the cache • Trace Lengths • First basic block is non-speculative, others are speculative • Longer traces = fewer entries in the lookup table, but more unexecuted code • Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

  17. Effect of Trace Length on Trace Count

  18. Effect of Trace Length on Memory

  19. Rewriting Instructions • Pin must regularly rewrite branches • No atomic branch write on x86 • We use a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006

  20. Performance Results • We use the SPEC OMP 2001 benchmarks • OMP_NUM_THREADS environment variable • We compare • Native performance and scalability • Pin (no Pintool) performance scalability • Pin (lightweight Pintool) scalability • InsCount Pintool – counts instructions at BB granularity • Pin (middleweight Pintool) scalability • MemTrace Pintool – records memory addresses • Pin (heavyweight Pintool) scalability • CMP$im – collects memory addresses and applies a software model of the CMP cache

  21. Native Scalability of SPEC OMP 2001

  22. Performance Scalability (No Instrumentation)

  23. Performance Scalability (LightWeight Instrumentation)

  24. Performance Scalability (MiddleWeight Instrumentation)

  25. Performance Scalability (HeavyWeight Instrumentation)

  26. Memory Scalability

  27. Summary • Dynamic instrumentation tools are useful • In the multicore era, we must provide support for MT application analysis and simulation • Providing MT support in Pin was easy • Making it robust and scalable was not easy • http://www.pintool.org 26

More Related