1 / 54

Thesis idea evaluation - Automatic configuration of ASIP cores

Thesis idea evaluation - Automatic configuration of ASIP cores. by Shobana Padmanabhan June 23, 2004. Introduction. ASIP – (parameterized) embedded soft core In between custom and general-purpose designs E.g. ArcCores, HP, Tensilica, LEON Advantages

booker
Download Presentation

Thesis idea evaluation - Automatic configuration of ASIP cores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thesis idea evaluation - Automatic configuration of ASIP cores by Shobana Padmanabhan June 23, 2004

  2. Introduction • ASIP – (parameterized) embedded soft core • In between custom and general-purpose designs • E.g. ArcCores, HP, Tensilica, LEON • Advantages • Better application performance than a generic processor • Reuse existing components • Lower cost compared to custom processors • Goal is to get fastest or min runtime

  3. Methodology considerations • Customize per application domain, not app

  4. Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, …

  5. Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, … • Avoid exhaustive simulation • As the number of configurations is exponential • Simulating large data sets would be prohibitively time consuming…

  6. Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, … • Avoid exhaustive simulation • As the number of configurations is exponential • Simulating large data sets would be prohibitively time consuming… • Constraints • FPGA – limited area (cost, power constraints)

  7. Methodology considerations • Customize per application domain, not app • Base architecture + customizations • Customizations • Increased # of functional units, registers, memory accesses in parallel, depth of pipeline, possibly new instructions, … • Avoid exhaustive simulation • As the number of configurations is exponential • Simulating large data sets would be prohibitively time consuming… • Constraints • FPGA – limited area (cost, power constraints) • Architectural parameters are not independent

  8. Methodology considerations • Evaluation of proposed methodology • Compare the resulting configuration and runtime with hand-optimized configuration of benchmarks

  9. Approach 1 - Compiler directed • Compiler-directed customization of ASIP cores • by Gupta - UMD, Ko - Cornell, Barua – UMD • for the methodology • Processor evaluation in an embedded systems design environment • by Gupta, Sharma, Balakrishna – IIT Delhi, Malik – Princeton • for details of Processor description language and architectural parameters • Predicting performance potential of modern DSPs, • Retargetable estimation scheme for DSP architecture selection • by Ghazal, Newton, Rabaey – UC Berkeley • use more advanced processor features and compiler optimizations

  10. Methodology – basic idea • Start with basic architecture • Estimate application performance • Now, vary architecture (<= chip area) and find the best runtime • To avoid (exhaustive) simulation • Estimate runtime for a given configuration • Use a profiler • When the configuration changes, re-compile and not re-run • Change configuration, check area and infer new runtime • By using statistical data on inter-dependence of parameters

  11. Approach App

  12. Approach Profiler App

  13. Approach Profiler Retargetable performance estimator App

  14. Approach Base arch + space of proposed parameters Profiler Retargetable performance estimator App

  15. Approach Base arch + space of proposed parameters Profiler Architecture exploration engine Retargetable performance estimator App

  16. Approach Base arch + space of proposed parameters Area estimates & budget Profiler Architecture exploration engine Retargetable performance estimator App

  17. Approach Base arch + space of proposed parameters Area estimates & budget Profiler Optimal architectural parameters Architecture exploration engine Retargetable performance estimator App

  18. Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block)

  19. Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block (i.e. # of instructions that can be executed in parallel) by • converting to an internal format (Stanford University IF, which provides libraries to extract such info)

  20. Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block by • converting to an internal format (Stanford University IF, which provides libraries to extract such info) • Execution frequencies of each basic block by • A compiler-inserted instruction increments a global variable for each basic block

  21. Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block by • converting to an internal format (Stanford University IF, which provides libraries to extract such info) • Execution frequencies of each basic block by • A compiler-inserted instruction increments a global variable for each basic block • Number of clock cycles • A scheduler schedules each basic block to derive execution time on the processor (taking into account all parameters) • A processor description is needed for this and a language was developed (context free grammar)

  22. Performance estimator • runtime = (profile-collected basic block frequencies)* (scheduler-predicted runtime of that block) • Basic block by • converting to an internal format (Stanford University IF, which provides libraries to extract such info) • Execution frequencies of each basic block by • A compiler-inserted instruction increments a global variable for each basic block • Number of clock cycles • A scheduler schedules each basic block to derive execution time on the processor (taking into account all parameters) • A processor description is needed for this and a language was developed (context free grammar) • Scheduler combines this time, with frequencies of basic blocks, to estimate overall runtime

  23. Performance estimation, more formally • Derive runtime vs. parameter curve for each parameter (just recompile for every param) • Runtime = (profile-collected basic block frequencies) * (scheduler-predicted runtime of that block) • Runtime_function(pi) =(runtime for pi) / (base runtime)

  24. Performance estimation, more formally • Derive runtime vs. parameter curve for each parameter (just recompile for every param) • Runtime = (profile-collected basic block frequencies) * (scheduler-predicted runtime of that block) • Runtime_function(pi) =(runtime for pi) / (base runtime)

  25. Area estimation, formally • Obtain area vs. parameter curve for every parameter • Area_function(pi) = additional gate area for pi

  26. Area estimation, formally • Obtain area vs. parameter curve for every parameter • Area_function(pi) = additional gate area for pi

  27. Retargetable performance estimator • Profiler • Computes execution frequencies of each basic block • A compiler-inserted instruction increments a global variable for this

  28. Retargetable performance estimator • Profiler • Computes execution frequencies of each basic block • A compiler-inserted instruction increments a global variable for this • Data flow graph builder, for scheduling • Directed acyclic graph for a basic block – captures all dependencies (blocks in sequence; within a block in parallel) • Priority of operation, based on height of that operation in dependency graph

  29. Retargetable performance estimator • Profiler • Computes execution frequencies of each basic block • A compiler-inserted instruction increments a global variable for this • Data flow graph builder, for scheduling • Directed acyclic graph for a basic block – captures all dependencies (blocks in sequence; within a block, in parallel) • Priority of operation, based on height of that operation in dependency graph • Fine-grain scheduler estimates # of clock cycles by taking into account different architecture parameters • Schedules each basic block to derive execution time on the processor • Combines this with frequencies to estimate overall runtime • List scheduling is a greedy method that chooses next instruction in DAG in order of their priority (longer critical paths have higher priority)

  30. Retargetable performance estimator • Assumptions • All operations operate on operands in registers • Address computation of an array instruction are carried out by insertion of explicit address computation instructions

  31. The processor description language • Can express most embedded VLIW processors • Functional units in data path, w/ their operations, corresponding latencies, delays • Constraints in terms of operation slots & slot restrictions • Number of registers, write buses, ports in memory • Delay of branch operations • Concurrent load/ store operations • Final operation delay = (delay of functional unit) * (delay of operation)

  32. Architecture exploration engine • Chooses optimal parameter values – constrained optimization problem • Sum of all area_functions <= area_budget

  33. Architecture exploration engine • Chooses optimal parameter values – constrained optimization problem • Sum of all area_functions <= area_budget • If parameters are independent, pred_runtime = product of runtime for every parameter

  34. Architecture exploration engine • Chooses optimal parameter values – constrained optimization problem • Sum of all area_functions <= area_budget • If parameters are independent, pred_runtime = product of runtime for every parameter • Since they are not, pred_runtime = (product of runtime for every parameter) / dependence_constant(p1, …, pn) where dependence_constant is …

  35. Interdependence of parameters • dependence_constant is a heuristic for every combo of parameters that adjusts the gain for that combo

  36. Interdependence of parameters • dependence_constant is a heuristic for every combo of parameters that adjusts the gain for that combo • obtained by one-time, exhaustive simulation of standard benchmarks, for a combo of parameters

  37. Interdependence of parameters • dependence_constant is a heuristic for every combo of parameters that adjusts the gain for that combo • obtained by one-time, exhaustive simulation of standard benchmarks, for a combo of parameters • Dependence_constant(p1,…,pn) • = 1 for all pi = basei • = 1 for pj != basej, for all i != j, pi = basei • = (product of all runtime_function) / (actual_runtime for that combo)

  38. Evaluated parameters • On Philips TriMedia VLIW processor • Presence or absence of MAC • HW/ SW floating point • Single or dual-ported memory for parallel memory operations • Pipelined or non-pipelined memory unit

  39. Other customizable parameters • Register file size • Number of architectural clusters • Number and nature of functional units • Presence of an address generation unit • Optimized special operations • Multi-operation patterns • Memory data packing/ unpacking support • Memory addressing support • Control-flow support • Loop-level optimizations • Loop-level optimized patterns • Loop vectorization • Architecture-independent optimization

  40. For DSP applications • Functional unit composition • Ignore: cache misses, branch mis-predictions, separation of register files (or functional unit banks), register allocation conflicts • Register casting, if data-dependency interlocks exist in the architecture

  41. Performance gain from INDIVIDUALparameters • Runtime_function for each benchmark • Application for each of the chosen parameters – MAC, FPU, dual-ported memory, pipelined memory Figure from Gupta et al.

  42. Performance gain from COMBINEDparameters • Runtime_function for each benchmark • Application for selected combination of chosen parameters Figure from Gupta et al.

  43. Dependence constants for the combinations Figure from Gupta et al.

  44. (DSP) FFT benchmark Figure from Gupta et al.

  45. Results • Performance estimation error 2.5% • Recommended configuration same as hand-optimized

  46. Profile & use app parameters to eliminate processor or processor configuration

  47. App parameters & relevant processor features • Average block size • (Acceptable) branch penalty • # of multiply-accumulate operations • MAC • Ratio of address computation instructions to data computation instructions • Separate address generation ALU • Ratio of I/O instructions to total instructions • Memory bandwidth requirements • Average arc length in the data flow graph • Total # of registers • Unconstrained ASAP scheduler results • Operation concurrency and lower bound on performance • Assumptions • In average block size module, instructions associated with condition code evaluation of conditional structures and loops ignored • Each array instruction contributes to total by twice the # of dimensions • Array accesses are assumed to point to data in memory…

  48. Related work • Related work evaluates exhaustively or in isolation; no cost-area analysis • Commercial soft cores • User optimizes instruction set, addressing modes & sizes of internal memory banks; tool estimates area • Gong et. al • Performance analyzer evaluates machine parallelism, number of buses & connectivity, memory ports; does not account for dependency • Ghazal et. al • Predict runtime for advanced processor features & compiler optimizations such as optimized special operations, memory addressing support, control-flow support & loop-level optimization support. • Gupta et. al • Analyze application to select processor; no quantification of features; performance estimation thru exhaustive simulation • Kuulusa et. al, Herbert et. al, Shackleford et. al • Tools for architecture exploration by exhaustive search; evaluate instruction extensions • Custom fit processors • Also exhaustive search but targets a VLIW architecture – changeable memory sizes, register sizes, kinds and latencies of functional units and clustered machines; speedup/ cost graphs are derived for all combinations yielding pareto points

  49. Other related papers • Kuulusa et. al., Herbert et. al., Shackleford et. al. evaluate extensions to instruction set • Managing multi-configuration hardware via dynamic working set analysis • By Dhodapkar, Smith, Wisc • Reconfigurable custom computing as a supercomputer replacement • By Milne, University of South Australia

  50. Discussion

More Related