1 / 37

A Fast On-Chip Profiler Memory

A Fast On-Chip Profiler Memory. Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation.

rey
Download Presentation

A Fast On-Chip Profiler Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation

  2. Outline • Introduction • Problem Definition • Profiling Techniques • Pipelined Binary Search Tree • ProMem • Conclusions

  3. Our Solution: Add On-Chip Profiler Memory to Monitored Bus Monitor Embedded Bus Monitor Embedded Bus Goal: Determine # of Times Each Target Pattern Appears on the Bus ProMem Introduction Mem Processor I$ • Accepts 1 pattern/cycle D$ • Keeps Exact Counts Bridge Per. Per. Per. Per.

  4. Most Instructions Executed void compute() { // small Loop A for(i=0;…;…) … // small Loop B for(x=0;…;…) } Instructions Profile … … Loop A Loop N prog.c Profile Information Move Loop A to HW Processor Mem Configure FPGA Synthesis Per. Per. Per. FPGA FPGA Introduction

  5. Introduction • Profiling Can Be Used to Solve Many Problems • Optimization of frequently executed subroutines • Mapping frequently executed code and data to non-interfering cache regions • Synthesis of optimized hardware for common cases • Identifying frequent loops to map to a small low-power loop cache • Many Others!

  6. Input Patterns P={pi , …, pm} Bus B Target Patterns TP = {tpi, …, tpm} Target Pattern Counts CTP = {ctpi, …, ctpm} Problem Definition • Objective • Count number of times each target pattern appears on bus B • Requirements • Accept input patterns on every clock cycle • Monitoring any bus, e.g., deeply embedded buses in SOCs • Non-intrusive • Exact target pattern count Processor Mem p1 p2 … pm Per. Per. Per. Per. TP CTP tp1 11203 tp2 8876 … … tpm ctpm

  7. Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Software • Instrumenting Software • Adding code to count frequencies of desired code regions • Problems • Incurs runtime overhead • Possibly changes program behavior • Increase in code size prog.c for( … ){ … ctpm++; }

  8. Processor Mem p1 p2 … pm prog.c Per. Per. Per. Per. // ISR period = 10ms ISR{ //update profile info } Profiling Techniques - Software • Periodic Sampling • Interrupt processor at periodic interval • Read program counter and other internal registers • Problems • Disruption of runtime behavior during interrupt • Inaccurate

  9. Profiling Techniques - Software • Simulation • Execute application on instruction set simulator • Simulator keeps track of profile information • Problems • Difficult to model external environment which leads to inaccuracy • Extremely slow prog.c ISS profile information

  10. Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Hardware • Logic Analyzer • Probes placed directly on bus to be monitored • Problems • Cannot monitor embedded buses

  11. Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Hardware • Processor Support • Mainly event counters • Monitored events include cache misses, pipeline stalls, etc. • Problems • Few registers available • Reconfiguration needed to obtain a complete profile • Leads to inaccuracy

  12. p1 p2 … pm Mem Processor CAM Per. Per. Per. Per. Profiling Techniques - Hardware • Content-addressable memories (CAMs) • Fast search for a key in a large data set • Returns the address at which the key resides in a memory • Types • Fully Associative • RAM coupled with a smart controller

  13. p1 p2 … pm Mem Processor CAM Per. Per. = tp1 = tp2 = Per. Per. tp3 … = tpm Profiling Techniques - Hardware • Fully Associative CAMs • Simultaneously compares every location with the key • Problems • Does not scale well to larger memories • Increased access time as CAM size grows • Large Power Consumption

  14. p1 p2 … pm Mem Processor CAM Per. Per. SRAM Ctrl Per. Per. Profiling Techniques - Hardware • RAM coupled with a smart controller • Efficient lookup data structure in memory such as a binary tree or Patricia Trie • Problems • Multiple cycle lookup

  15. Observations • Not necessary to have 1 cycle look up • Only need to accept one input pattern every cycle

  16. Bus B FIFO CAM SRAM Ctrl Queueing • Hold input patterns in queue until we are able to process them • Problems • Does not work with patterns arriving every clock cycle

  17. Pipelining • Implemented in processors such that instructions can be executed every cycle • Can we use pipelining to solve our problem?

  18. CAM CAM CAM CAM Pipeline Reg Pipeline Reg Pipeline Reg Pipeline Reg Pipelined CAM • Large CAMs required long access times • Partition large CAM into several smaller CAMs • Requires pipelining to reduce access time • Provides solution to access time problem • Requires Large Area • Large Power Consumption CAM

  19. Pipelined CAM • Entries can be stored in a CAM in any order • requires sequential lookup in pipelined CAM approach • Is there a benefit to sorting the entries? • not necessary to search all entries • leads to faster lookup time • Tree structure provides a inherently sorted structure • Search time remains a problem • Can we pipeline the structure?

  20. = = = = Pipelined Tree • Solves access time problem • One memory access per level • Solves area problem • Single comparator per level • Each level grows by factor of two • For large memories, comparators are negligible

  21. Root Node Each node has at most two children h Right child < Parent Left child > Parent j d k i f b g e c a Pipelined Binary Search Tree

  22. f < h, go right h Stage 0 h h h Stage 1 d d j d f > d, go left f Stage 2 k i f b f = f, Found! Stage 3 g e c a Pipelined Binary Search Tree Searching for Input Pattern: f

  23. e < h, append 0 to address h h h Stage 0 h h 0 0 0 0 1 0 e > d, append 1 to address d d Stage 1 d j d 01 01 01 11 10 01 00 e < f, append 0 to address f f Stage 2 k i f b 010 010 011 010 001 000 e = e, Found! Stage 3 e g e c a Pipelined Binary Search Tree Searching for Input Pattern: e

  24. f < h, append 0 to address h e < h, append 0 to address Stage 0 h h 0 0 e > d, append 1 to address Stage 1 j d d f > d, append 1 to address d 01 Stage 2 f = f, Found! k i f b f 01 e < f, append 0 to address f Stage 3 e = e, Found! g e c a e 010 Pipelined Binary Search Tree Searching for Input Pattern: e, f

  25. Stage 0 h 1 0 Standard Memories Stage 1 j d 11 10 01 00 Stage 2 k i f b 011 010 001 000 011 010 001 000 Stage 3 - - - - g e c a - - - - Pipelined Binary Search Tree

  26. Enable Input Pattern Search Address ps_i As_i cen_i > ps > As > cen Pipeline regs Enable (Next Stage) Search Address (Next Stage) Input Pattern ProMem stage s ps_o As+1_o cen_o ProMem – Module Design

  27. ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Memory TPMs (2s×w) rd addr dout ProMem stage s ps_o As+1_o cen_o ProMem – Module Design

  28. ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Found Search for Target Pattern Compare > = Target Pattern Not Found – Enable Next Stage ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) rd addr dout

  29. ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Count Memory CMs (2s×c) rd addr wr dout ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) rd addr dout Compare > =

  30. ps_i As_i cen_i > ps > As > cen Pipeline regs When Target Pattern Found - Update Count Value 1 +1 ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) CMs (2s×c) rd rd addr addr wr dout dout Compare > =

  31. ps_i As_i cen_i Pipeline Register > ps > As > cen Pipeline regs Memories Module Controller ModuleController ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) CMs (2s×c) 1 rd rd addr addr wr dout dout Compare +1 > =

  32. p1 p2 … pm Mem Processor clk Per. Per. cen addr ProMem ren wen Per. Per. ProMem - Interface • Simple Interface • Internal interface • Enable signal • Connection to monitored bus • External interface • Read enable • Write enable • Connection to ProMem pattern input bus

  33. ps_i As_i cen_i > ps > As > cen Pipeline regs TPMs (2s×w) CMs (2s×c) 1 rd rd addr addr wr dout dout Compare +1 > = ProMem stage s ps_o As+1_o cen_o ProMem - Layout • Efficient Layout • Achieved by simply abutting each module with the next • Results in very short bus wires between each module

  34. Module overhead only 1% ProMem Results – Area* *Area obtained using UMC .18 technology library provided by Artisan Components

  35. CAM design is 46% larger than ProMem ProMem Results – vs. CAM

  36. CAM access time grows with CAM size ProMem access time remains constant (Due to Pipelining) ProMem Results – Timing vs. CAM

  37. Conclusions • Introduced a new memory structure specifically for fast on-chip profiling • One pattern per cycle throughput • Simple interface to monitored bus • Efficient design is very scalable

More Related