1 / 32

Making the “Box” Transparent: System Call Performance as a First-class Result

Making the “Box” Transparent: System Call Performance as a First-class Result. Yaoping Ruan, Vivek Pai Princeton University. Outline. Motivation Design & implementation Case study More results. Motivation. Beginning of the Story.

pello
Download Presentation

Making the “Box” Transparent: System Call Performance as a First-class Result

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making the “Box” Transparent:System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University

  2. Outline • Motivation • Design & implementation • Case study • More results

  3. Motivation Beginning of the Story • Flash Web Server on the SPECWeb99 benchmark yield very low score – 200 • 220: Standard Apache • 400: Customized Apache • 575: Best - Tux on comparable hardware • However, Flash Web Server is known to be fast

  4. Motivation Performance Debugging ofOS-Intensive Applications Web Servers (+ full SpecWeb99) • High throughput • Large working sets • Dynamic content • QoS requirements • Workload scaling CPU/network disk activity multiple programs latency-sensitive overhead-sensitive How do we debug such a system?

  5. Motivation • Statistical sampling • DCPI, Vtune, Oprofile • Statistical sampling • DCPI, Vtune, Oprofile • Fast • Fast • Completeness • Completeness • Call graph profiling • gprof • Call graph profiling • gprof • Detailed • Detailed • High overhead > 40% • High overhead > 40% • Measurement calls • getrusage() • gettimeofday( ) • Measurement calls • getrusage() • gettimeofday( ) • Online • Online • Guesswork, inaccuracy • Guesswork, inaccuracy Current Tradeoffs

  6. Motivation In-kernel measurement Example of Current Tradeoffs gettimeofday( ) Wall-clock time of an non-blocking system call

  7. Design Design Goals • Correlate kernel information with application-level information • Low overhead on “useless” data • High detail on useful data • Allow application to control profiling and programmatically react to information

  8. Design DeBox: Splitting Measurement Policy & Mechanism • Add profiling primitives into kernel • Return feedback with each syscall • First-class result, like errno • Allow programs to interactively profile • Profile by call site and invocation • Pick which processes to profile • Selectively store/discard/utilize data

  9. Design online DeBox Info or offline DeBox Architecture … res = read (fd, buffer, count) … App buffer errno Kernel read ( ) { Start profiling; … End profiling Return; }

  10. Implementation Basic DeBox Info Performance summary 0 Sleep Info 1 Blocking information of the call … 0 Trace Info 1 Full call trace & timing 2 … DeBox Data Structure DeBoxControl ( DeBoxInfo *resultBuf, int maxSleeps, int maxTrace )

  11. Implementation Basic DeBox Info PerSleepInfo[0] 4 System call # (write( )) 1270 # occurrences CallTrace 3591064 Call time (usec) 723903 Time blocked (usec) 0 3591064 write( ) (depth) 989 # of page faults biowr Resource label 1 25 holdfp( ) 2 # of PerSleepInfo used kern/vfs_bio.c File where blocked 2 5 fhold( ) 0 # of CallTrace used 2727 Line where blocked 1 3590326 dofilewrite( ) 1 # processes on entry 2 3586472 fo_write( ) 0 # processes on exit … … … Sample Output: Copying a 10MB Mapped File

  12. Implementation In-kernel Implementation • 600 lines of code in FreeBSD 4.6.2 • Monitor trap handler • System call entry, exit, page fault, etc. • Instrument scheduler • Time, location, and reason for blocking • Identify dynamic resource contention • Full call path tracing + timings • More detail than gprof’s call arc counts

  13. Implementation General DeBox Overheads

  14. Case Study SPECWeb99 Results 100 200 300 400 500 600 700 800 900 0 orig VM Patch sendfile Dataset size (GB): 0.12 0.75 1.38 2.01 2.64 Case Study: Using DeBox on the Flash Web server • Experimental setup • Mostly 933MHz PIII • 1GB memory • Netgear GA621 gigabit NIC • Ten PII 300MHz clients

  15. Case Study biord - 162 inode - 28 inode - 84 inode - 9 biord - 3 inode - 6 biord - 1 Example 1: Finding Blocking System Calls Blocking system calls, resource blocked and their counts

  16. Case Study SPECWeb99 Results 100 200 300 400 500 600 700 800 900 0 orig VM Patch sendfile FD Passing Dataset size (GB): 0.12 0.75 1.38 2.01 2.64 Example 1 Results & Solution • Mostly metadata locking • Direct all metadata calls to name convert helpers • Pass open FDs using sendmsg( )

  17. Case Study Example 2: Capturing Rare Anomaly Paths FunctionX( ) { … open( ); … } FunctionY( ) { open( ); … open( ); } When blocking happens: abort( ); (or fork + abort) Record call path • Which call caused the problem • Why this path is executed (App + kernel call trace)

  18. Case Study SPECWeb99 Results 100 200 300 400 500 600 700 800 900 0 orig VM Patch sendfile FD Passing FD Passing Dataset size (GB): 0.12 0.75 1.38 2.01 2.64 Example 2 Results & Solution • Cold cache miss path • Allow read helpers to open cache missed files • More benefit in latency

  19. Case Study Example 3: Tracking Call History & Call Trace • Call trace indicates: • File descriptors copy - fd_copy( ) • VM map entries copy - vm_copy( ) Call time of fork( ) as a function of invocation

  20. Case Study SPECWeb99 Results 100 200 300 400 500 600 700 800 900 0 orig VM Patch sendfile FD Passing Fork helper No mmap() Dataset size (GB): 0.12 0.75 1.38 2.01 2.64 Example 3 Results & Solution • Similar phenomenon on mmap( ) • Fork helper • eliminate mmap( )

  21. Case Study Sendfile Modifications(Now Adopted by FreeBSD) • Cache pmap/sfbuf entries • Return special error for cache misses • Pack header + data into one packet

  22. Case Study 200 900 300 400 500 600 700 800 100 0 orig VM Patch sendfile FD Passing Fork helper No mmap() New CGI Interface New sendfile( ) 0.12 0.43 0.75 1.06 1.38 1.69 2.01 2.32 2.64 2.95 Dataset size (GB) SPECWeb99 Scores

  23. Case Study New Flash Summary • Application-level changes • FD passing helpers • Move fork into helper process • Eliminate mmap cache • New CGI interface • Kernel sendfile changes • Reduce pmap/TLB operation • New flag to return if data missing • Send fewer network packets for small files

  24. Other Results 700 600 500 400 Throughput (Mb/s) 300 200 100 0 500 MB 1.5 GB 3.3 GB Dataset Size Apache Flash New-Flash Throughput on SPECWeb Static Workload

  25. Other Results 44X 47X Latencies on 3.3GB Static Workload Mean latency improvement: 12X

  26. Other Results Throughput Portability (on Linux) 1400 1200 1000 800 Throughput (Mb/s) 600 400 200 0 500 MB 1.5 GB 3.3 GB Dataset Size Haboob Apache Flash New-Flash (Server: 3.0GHz P4, 1GB memory)

  27. Other Results 112X 3.7X Latency Portability(3.3GB Static Workload) Mean latency improvement: 3.6X

  28. Summary • DeBox is effective on OS-intensive application and complex workloads • Low overhead on real workloads • Fine detail on real bottleneck • Flexibility for application programmers • Case study • SPECWeb99 score quadrupled • Even with dataset 3x of physical memory • Up to 36% throughput gain on static workload • Up to 112xlatency improvement • Results are portable

  29. www.cs.princeton.edu/~yruan/deboxyruan@cs.princeton.eduvivek@cs.princeton.eduwww.cs.princeton.edu/~yruan/deboxyruan@cs.princeton.eduvivek@cs.princeton.edu Thank you

  30. SpecWeb99 Scores • Standard Flash 200 • Standard Apache 220 • Apache + special module 420 • Highest 1GB/1GHz score 575 • Improved Flash 820 • Flash + dynamic request module 1050

  31. SpecWeb99 on Linux • Standard Flash 600 • Improved Flash 1000 • Flash + dynamic request module 1350 • 3.0GHz P4 with 1GB memory

  32. New Flash Architecture

More Related