1 / 25

Automatic Application Profiling

Automatic Application Profiling. Lecture 22. Today – What parts of the code are slow?. Amdahl’s law How to get the processor to tell us what’s taking the most time – Statistical program counter sampling. Amdahl’s Law. Gene Amdahl - “Optimize the common case”

Download Presentation

Automatic Application Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Application Profiling Lecture 22

  2. Today – What parts of the code are slow? • Amdahl’s law • How to get the processor to tell us what’s taking the most time – Statistical program counter sampling

  3. Amdahl’s Law • Gene Amdahl - “Optimize the common case” • Double the speed of ¼ of the program: • Quadruple the speed of ¼ of the program:

  4. Work on the Slow Part • Double the speed of ¾ of the program: • Quadruple the speed of ¾ of the program:

  5. How Do We Find the Slow Parts • Option A: Measure the amount of time each region takes to execute • Code unsigned long times[NUM_FUNCTIONS]; void my_function(int i) { t_start = get_ticks(); /* function body */ times[THIS_FUNCTION_NUM] += get_ticks() - t_start; } • Pros • Exact. Can get single cycle accuracy if needed. • Cons • Tedious. Must add code to each function to be monitored. • Need access to source code, which may be a problem for library functions.

  6. Program Counter Sampling • Option B: Periodically examine the PC to see what’s running • Result shows average fraction of time spent executing a region of code • Supporting data structure: table of region information • Starting and ending addresses: defined before sampling • Execution counts:updated during sampling • Sampling • Use a timer to interrupt application periodically • Within ISR • Read PC off of stack • Examine table of region addresses to determine currently executing region N • Increment entry N of execution count table • Also increment a total number of ticks variable (to reveal out-of-range PC instances) • At end • Provide execution count table to user (via file, serial port, debugger, etc.) typedef struct { char Name[PROFILE_NAME_SIZE]; unsigned long Start, End; unsigned Count; } PROFILE_T;

  7. Configure Timer to Generate Periodic Interrupt • Call this function with desired sampling frequency in Hz void Init_Profiling(unsigned samp_freq) { unsigned long divider; // set up timer A0 to interrupt at samp_freq ta0mr = 0x00; divider = ((unsigned long)MAIN_CLOCK)/samp_freq; if (divider > 0x0ffffl) ta0 = 0x0ffff; else ta0 = (unsigned) (divider & 0x0ffff); DISABLE_IRQ; ta0ic = 1; ENABLE_IRQ; ta0s = 1; }

  8. Profiler Interrupt Service Routine #pragma INTERRUPT/B profile_intr void profile_intr(void) { unsigned char PC_H; unsigned int PC_ML; unsigned long PC; unsigned char i; /* Get PC from stack */ _asm("mov.w 2[FB], $$[FB]", PC_ML); _asm("mov.b 5[FB], $$[FB]", PC_H); PC = PC_H; PC <<= 16; PC += PC_ML; profile_ticks++; /* look up function in table and inc. counter */ for (i=0; i<NUM_PROFILE_REGIONS; i++) { if ((PC >= profiles[i].Start) && (PC <= profiles[i].End)) { profiles[i].Count++; return; } } } • Don’t forget to register ISR in vector table!

  9. Configure Profile Table with Region Information To profile modules (source files) • Where do we find region addresses? Next page. PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0xf0380, UL 0xf0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, }; To profile functions PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main", UL main, UL (main+0x0199), 0}, {"LCD_Erase", UL LCD_Erase_FB, UL (LCD_Erase_FB+0x0240), 0}, {"sin", UL sin, UL (sin+0x0138), 0}, {"LCD_Plot_in_FB", UL LCD_Plot_in_FB, UL (LCD_Plot_in_FB+0x02e3), 0}, {"LCD_Display_FB", UL LCD_Display_FB, UL (LCD_Display_FB+0x0302), 0}, {"DisplayDelay", UL DisplayDelay, UL (DisplayDelay+0x01ad), 0}, {"LCD_write", UL LCD_write, UL (LCD_write+0x017f), 0}, {"", UL 0, UL 0, 0} };

  10. Finding the Region Addresses • Get module addresses from linker’s map file (in debug directory) • Get function lengths (if needed) from .LST file • Second column is start address of each assembly instruction • Subtract function’s first address from its last address to find length program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD 506 ;## # FUNCTION LCD_Erase_FB 507 ;## # FRAME AUTO (y) size 2, offset -4 508 ;## # FRAME AUTO (x) size 2, offset -2 509 ;## # ARG Size(0) Auto Size(4) Context Size(5) 510 511 .align 512 ;## # C_SRC : void LCD_Erase_FB(void) { 513 .glb _LCD_Erase_FB 514 00212 _LCD_Erase_FB: 515 00212 7CF204 enter #04H 516 ;## # C_SRC : for (x=0; x<8; x++) 517 00215 D90BFE mov.w #0000H,-2[FB] ; x P (etc.) 539 ;## # C_SRC : } 540 0023F 7DF2 exitd 541 00241 E7:

  11. Overview of Profiling Approach • Start with the big picture (rough details) and use that to determine where to look next • Profiling sequence • module-level (file-level) • function-level within the most common module • basic block-level within the most common function

  12. Detailed Steps to using profile.c/h • Enable list file creation for each C source file • HEW: Options -> Renesas M16C Standard Toolchain • C Tab -> Category: List. Check –dS and –dSL boxes • Fill in array profiles with region addresses (e.g. names of functions), dummy lengths, and zero counts • Compile • May need to add function prototypes if profiles table is declared before the functions are • Update profiles array with correct starting and ending addresses • Recompile • Run • Examine profiles after running long enough

  13. How Long is Enough? • Complex statistical question • The statistician I asked said “it depends” and changed the subject • So, run it until the digits you care about stop changing • Example: Module-level profiling

  14. Where does the Lab 4 Skeleton Spend its Time? DisplayString(LCD_LINE1," Lab #4 "); DisplayString(LCD_LINE2," Starter"); GRN_LED = LED_ON; while (1) { for (f=6.0; f>0.0; f -= 0.4) { LCD_Erase_FB(); for (i=0; i<DISP_WIDTH_PIXELS; i++) LCD_Plot_in_FB((unsigned char)i, (unsigned char) (3.5*(sin(i/f)+1.0)), 1); LCD_Display_FB(LCD_LINE1); } } We know there are delay loops executed every time the MCU writes to the LCD, but let’s verify how bad they are • Start with modules • Then look at functions in module • Then look at basic blocks within function

  15. Step One: Profile Modules # SECTION ATR TYPE START LENGTH ALIGN MODULENAME program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8 • Define a profile region per module, and one for all the library functions

  16. Step One: Profile Modules PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"main.c", UL 0x0f0100, UL 0x0f0299, 0}, {"profile.c", UL 0x0f029a, UL 0x0f037f, 0}, {"skp26.c", UL 0x0f0380, UL 0x0f0613, 0}, {"skp_lcd.c", UL 0x0f0614, UL 0x0f0917, 0}, {"library", UL 0x0f0918, UL 0x0f290d, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, {"", UL 0, UL 0, 0}, };

  17. Step 1 Results • Surprise! The LCD functions aren’t taking up most of the processor’s time! The library functions are instead

  18. Step Two: Profile Library # SECTION ATR TYPE START LENGTH ALIGN MODULENAME program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8 • We only have eight entries in our table, so let’s split up the library into eight regions of about three functions each lib 1 lib 2 lib 3 lib 4 lib 5 lib 6 lib 7 lib 8

  19. Step Two: Profile Library PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"lib 1", UL 0x0f0918, UL 0x0f161b, 0}, {"lib 2", UL 0x0f161c, UL 0x0f1a0b, 0}, {"lib 3", UL 0x0f1a0c, UL 0x0f1c7d, 0}, {"lib 4", UL 0x0f1c7e, UL 0x0f1deb, 0}, {"lib 5", UL 0x0f1dec, UL 0x0f22d9, 0}, {"lib 6", UL 0x0f22da, UL 0x0f2699, 0}, {"lib 7", UL 0x0f269a, UL 0x0f2823, 0}, {"lib 8", UL 0x0f2824, UL 0x0f290d, 0} };

  20. Step Two Results • Functions in group lib 6 are taking the most time, followed by lib 4 and lib 1

  21. Step Three: Profile Top Library Functions # SECTION ATR TYPE START LENGTH ALIGN MODULENAME program REL CODE 0F00FF 000000 NCRT0_26SKP REL CODE 0F0100 00019A 2 MAIN REL CODE 0F029A 0000E5 2 PROFILE REL CODE 0F0380 000293 2 SKP26 REL CODE 0F0614 000303 2 SKP_LCD REL CODE 0F0918 00020F 2 _F4DIV REL CODE 0F0B28 00008D 2 _F4TOF8 REL CODE 0F0BB6 000A65 2 _F8ADD REL CODE 0F161C 0000A7 2 _F8LE REL CODE 0F16C4 000069 2 _F8LTOR REL CODE 0F172E 0002DE 2 _F8MUL REL CODE 0F1A0C 0000BA 2 _F8TOF4 REL CODE 0F1AC6 000025 2 _F8TOI4U REL CODE 0F1AEC 000192 2 _FTOL REL CODE 0F1C7E 00004D 2 _I4DIVU REL CODE 0F1CCC 000022 2 _I4TOF4 REL CODE 0F1CEE 0000FD 2 _LTOF REL CODE 0F1DEC 000138 2 SIN REL CODE 0F1F24 00035D 2 TAN REL CODE 0F2282 000058 2 _F4LTOR REL CODE 0F22DA 000060 2 _F4RTOL REL CODE 0F233A 0002E4 2 _F8DIV REL CODE 0F261E 00007C 2 _F8EQ REL CODE 0F269A 0000A7 2 _F8LT REL CODE 0F2742 00007C 2 _F8NE REL CODE 0F27BE 000066 2 _F8RTOL REL CODE 0F2824 00002E 2 _F8SUB REL CODE 0F2852 000025 2 _F8TOI4 REL CODE 0F2878 000073 2 _I4MOD REL CODE 0F28EC 000022 2 _I4TOF8 • Examine the nine functions in these three groups, grouping two functions together

  22. Step Three: Profile Top Library Functions PROFILE_T profiles[NUM_PROFILE_REGIONS] = { {"_F4DIV", UL 0x0f0918, UL 0x0f0b27, 0}, {"_F4TOF8+_F8ADD", UL 0x0f0b28, UL 0x0f161b, 0}, {"_I4DIVU", UL 0x0f1c7e, UL 0x0f1ccb, 0}, {"_I4TOF4", UL 0x0f1ccc, UL 0x0f1ced, 0}, {"_LTOF", UL 0x0f1cee, UL 0x0f1deb, 0}, {"_F4RTOL", UL 0x0f22da, UL 0x0f2339, 0}, {"_F8DIV", UL 0x0f233a, UL 0x0f261d, 0}, {"_F8EQ", UL 0x0f261e, UL 0x0f2699, 0} };

  23. Step Three Results • Most time spent in double precision floating point divide • 3.5*(sin(i/f)+1.0) is culprit. Avoid floating point when possible

  24. Disadvantages of Sampling • Sampling is inexact - not guaranteed to get everything that runs • Code which disables interrupts (e.g. ISRs, OS code) is not measured • Rarely executed code may be missed • Takes time for numbers to settle down • Profile changes based on mode of program • If manually creating table, user needs to update address table with each code change

  25. Implementing Instrumentation • Tedious to do manually for large programs, so automate • Have compiler instrument code for you • gcc and other compilers support profiling using a command line switch • They provide a tool to process the output file to determine how much time each function takes • Can also modify the binary (after compilation): Atom from DEC’s Western Research Lab, Etch, EEL • Tool processes binary files to run your instrumentation procedures for each procedure, basic block, or instruction • For the M16C, what would be best? • Create a program which reads the map file and creates a C file declaring our profiles array with correct region names and addresses • Probably easiest to use a scripting language: sed, awk, perl, (f)lex • Probably not enough memory to instrument all functions, so must be selective

More Related