370 likes | 477 Views
This research paper discusses adding an on-chip profiler memory to improve embedded bus monitoring efficiency. It explores techniques like CAMs and Pipelined Binary Search Trees to enhance the profiling process on monitored buses. By maintaining exact counts of target patterns, this solution aids in various optimizations and mappings. The study investigates software and hardware profiling techniques, along with the challenges faced and potential solutions for accurate and non-intrusive monitoring.
E N D
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation
Outline • Introduction • Problem Definition • Profiling Techniques • Pipelined Binary Search Tree • ProMem • Conclusions
Our Solution: Add On-Chip Profiler Memory to Monitored Bus Monitor Embedded Bus Monitor Embedded Bus Goal: Determine # of Times Each Target Pattern Appears on the Bus ProMem Introduction Mem Processor I$ • Accepts 1 pattern/cycle D$ • Keeps Exact Counts Bridge Per. Per. Per. Per.
Most Instructions Executed void compute() { // small Loop A for(i=0;…;…) … // small Loop B for(x=0;…;…) } Instructions Profile … … Loop A Loop N prog.c Profile Information Move Loop A to HW Processor Mem Configure FPGA Synthesis Per. Per. Per. FPGA FPGA Introduction
Introduction • Profiling Can Be Used to Solve Many Problems • Optimization of frequently executed subroutines • Mapping frequently executed code and data to non-interfering cache regions • Synthesis of optimized hardware for common cases • Identifying frequent loops to map to a small low-power loop cache • Many Others!
Input Patterns P={pi , …, pm} Bus B Target Patterns TP = {tpi, …, tpm} Target Pattern Counts CTP = {ctpi, …, ctpm} Problem Definition • Objective • Count number of times each target pattern appears on bus B • Requirements • Accept input patterns on every clock cycle • Monitoring any bus, e.g., deeply embedded buses in SOCs • Non-intrusive • Exact target pattern count Processor Mem p1 p2 … pm Per. Per. Per. Per. TP CTP tp1 11203 tp2 8876 … … tpm ctpm
Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Software • Instrumenting Software • Adding code to count frequencies of desired code regions • Problems • Incurs runtime overhead • Possibly changes program behavior • Increase in code size prog.c for( … ){ … ctpm++; }
Processor Mem p1 p2 … pm prog.c Per. Per. Per. Per. // ISR period = 10ms ISR{ //update profile info } Profiling Techniques - Software • Periodic Sampling • Interrupt processor at periodic interval • Read program counter and other internal registers • Problems • Disruption of runtime behavior during interrupt • Inaccurate
Profiling Techniques - Software • Simulation • Execute application on instruction set simulator • Simulator keeps track of profile information • Problems • Difficult to model external environment which leads to inaccuracy • Extremely slow prog.c ISS profile information
Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Hardware • Logic Analyzer • Probes placed directly on bus to be monitored • Problems • Cannot monitor embedded buses
Processor Mem p1 p2 … pm Per. Per. Per. Per. Profiling Techniques - Hardware • Processor Support • Mainly event counters • Monitored events include cache misses, pipeline stalls, etc. • Problems • Few registers available • Reconfiguration needed to obtain a complete profile • Leads to inaccuracy
p1 p2 … pm Mem Processor CAM Per. Per. Per. Per. Profiling Techniques - Hardware • Content-addressable memories (CAMs) • Fast search for a key in a large data set • Returns the address at which the key resides in a memory • Types • Fully Associative • RAM coupled with a smart controller
p1 p2 … pm Mem Processor CAM Per. Per. = tp1 = tp2 = Per. Per. tp3 … = tpm Profiling Techniques - Hardware • Fully Associative CAMs • Simultaneously compares every location with the key • Problems • Does not scale well to larger memories • Increased access time as CAM size grows • Large Power Consumption
p1 p2 … pm Mem Processor CAM Per. Per. SRAM Ctrl Per. Per. Profiling Techniques - Hardware • RAM coupled with a smart controller • Efficient lookup data structure in memory such as a binary tree or Patricia Trie • Problems • Multiple cycle lookup
Observations • Not necessary to have 1 cycle look up • Only need to accept one input pattern every cycle
Bus B FIFO CAM SRAM Ctrl Queueing • Hold input patterns in queue until we are able to process them • Problems • Does not work with patterns arriving every clock cycle
Pipelining • Implemented in processors such that instructions can be executed every cycle • Can we use pipelining to solve our problem?
CAM CAM CAM CAM Pipeline Reg Pipeline Reg Pipeline Reg Pipeline Reg Pipelined CAM • Large CAMs required long access times • Partition large CAM into several smaller CAMs • Requires pipelining to reduce access time • Provides solution to access time problem • Requires Large Area • Large Power Consumption CAM
Pipelined CAM • Entries can be stored in a CAM in any order • requires sequential lookup in pipelined CAM approach • Is there a benefit to sorting the entries? • not necessary to search all entries • leads to faster lookup time • Tree structure provides a inherently sorted structure • Search time remains a problem • Can we pipeline the structure?
= = = = Pipelined Tree • Solves access time problem • One memory access per level • Solves area problem • Single comparator per level • Each level grows by factor of two • For large memories, comparators are negligible
Root Node Each node has at most two children h Right child < Parent Left child > Parent j d k i f b g e c a Pipelined Binary Search Tree
f < h, go right h Stage 0 h h h Stage 1 d d j d f > d, go left f Stage 2 k i f b f = f, Found! Stage 3 g e c a Pipelined Binary Search Tree Searching for Input Pattern: f
e < h, append 0 to address h h h Stage 0 h h 0 0 0 0 1 0 e > d, append 1 to address d d Stage 1 d j d 01 01 01 11 10 01 00 e < f, append 0 to address f f Stage 2 k i f b 010 010 011 010 001 000 e = e, Found! Stage 3 e g e c a Pipelined Binary Search Tree Searching for Input Pattern: e
f < h, append 0 to address h e < h, append 0 to address Stage 0 h h 0 0 e > d, append 1 to address Stage 1 j d d f > d, append 1 to address d 01 Stage 2 f = f, Found! k i f b f 01 e < f, append 0 to address f Stage 3 e = e, Found! g e c a e 010 Pipelined Binary Search Tree Searching for Input Pattern: e, f
Stage 0 h 1 0 Standard Memories Stage 1 j d 11 10 01 00 Stage 2 k i f b 011 010 001 000 011 010 001 000 Stage 3 - - - - g e c a - - - - Pipelined Binary Search Tree
Enable Input Pattern Search Address ps_i As_i cen_i > ps > As > cen Pipeline regs Enable (Next Stage) Search Address (Next Stage) Input Pattern ProMem stage s ps_o As+1_o cen_o ProMem – Module Design
ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Memory TPMs (2s×w) rd addr dout ProMem stage s ps_o As+1_o cen_o ProMem – Module Design
ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Found Search for Target Pattern Compare > = Target Pattern Not Found – Enable Next Stage ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) rd addr dout
ps_i As_i cen_i > ps > As > cen Pipeline regs Target Pattern Count Memory CMs (2s×c) rd addr wr dout ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) rd addr dout Compare > =
ps_i As_i cen_i > ps > As > cen Pipeline regs When Target Pattern Found - Update Count Value 1 +1 ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) CMs (2s×c) rd rd addr addr wr dout dout Compare > =
ps_i As_i cen_i Pipeline Register > ps > As > cen Pipeline regs Memories Module Controller ModuleController ProMem stage s ps_o As+1_o cen_o ProMem – Module Design TPMs (2s×w) CMs (2s×c) 1 rd rd addr addr wr dout dout Compare +1 > =
p1 p2 … pm Mem Processor clk Per. Per. cen addr ProMem ren wen Per. Per. ProMem - Interface • Simple Interface • Internal interface • Enable signal • Connection to monitored bus • External interface • Read enable • Write enable • Connection to ProMem pattern input bus
ps_i As_i cen_i > ps > As > cen Pipeline regs TPMs (2s×w) CMs (2s×c) 1 rd rd addr addr wr dout dout Compare +1 > = ProMem stage s ps_o As+1_o cen_o ProMem - Layout • Efficient Layout • Achieved by simply abutting each module with the next • Results in very short bus wires between each module
Module overhead only 1% ProMem Results – Area* *Area obtained using UMC .18 technology library provided by Artisan Components
CAM design is 46% larger than ProMem ProMem Results – vs. CAM
CAM access time grows with CAM size ProMem access time remains constant (Due to Pipelining) ProMem Results – Timing vs. CAM
Conclusions • Introduced a new memory structure specifically for fast on-chip profiling • One pattern per cycle throughput • Simple interface to monitored bus • Efficient design is very scalable