Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapahts

Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapahts Zhining Huang and Sharad Malik DAC 2002

Outline • Introduction • Methodology overview • Datapaths for kernel loops • Reconfigurable Datapath • Benchmark studies • Conclusions and futrue work

Introduction • Programmable platforms • Bit-level programmable • Coarse-grained FPGA • Word-level programmable • Dynamically reconfigurable co-processor • Fixed hardware blocks • Programmable interconnect • Accelerate loops

Methodology overview • Master processor • Reconfigurable co-processor • Reconfigurable datapath • ASIC-like function units • Reconfigurable interconnections • Control logic • State machine • Control datapath execution

Datapaths for kernel loops • Extract kernel loops • Direct mapping of kernel loop datapath • Branch condition transforms • Pipelining the execution • Estimation of the pipeline execution time

Extract kernel loops • Use IMPACT compiler • Profiling • Loop detection (only inner loops) • Data dependence analysis • Register live-in/out • Data dependence between instructions • Within a loop • In different loop

Direct mapping of kernel loop datapath

Branch condition transforms • Into different datapaths • Selected by multiplexer

Pipelining the execution • Insert registers in the datapah • Data dependence • Delay or by pass • From registers or memory operations • Four data dependence cases and solutions • Tstore<Tload : delay store • Tstore1<Tstore2 : eliminate store1 • Two loads : do nothing • Tload<Tstore : delay or bypass

Estimation of the pipeline execution time • T=[S+D*(N-1)]+O+W cycles • S: total number of pipeline stages of a datapath • D: delay of the consecutive loop iteration • N: loop iteration number • O: switch overhead • W: write back cycles

Reconfigurable Datapath • Datapath mapping • Routing box • Critical path and clock speed • Reconfiguration overhead

Datapath mapping

Routing box

Critical path and clock speed • Trouting box+Tfunction unit+Twire delay • Sophisticated control and function unit • Twice or more cycle for longer timing

Reconfiguration overhead • Overhead • Reconfiguration context switch • Execution switching to the datapath • Execution switching back • Number of loops are selected • 8 or 16 • Control bits are stored in distributed co-processor • Register file bandwidth in master processor

Benchmark studies

Conclusions and future work • Methodology • Dynamically reconfigurable datapath • For a specific applications • The co-processor can be viewed as a VLIW • Loop restructuring techniques • Reduce data dependencies

Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapahts

Exploiting Operation Level Parallelism through Dynamically Reconfigurable Datapahts

Presentation Transcript

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Exploiting Parallelism

PHOTON A Dynamically Reconfigurable Hybrid

Exploiting Parallelism on GPUs

Janus : exploiting parallelism via hindsight

Instruction-Level Parallelism

Dynamically Reconfigurable Architectures: An Overview

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamically Reconfigurable Neurons

DRRA Dynamically Reconfigurable Resource Array

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting Parallelism

Instruction Level Parallelism: Loop Level Parallelism

Exploiting Parallelism

Warp Processor: A Dynamically Reconfigurable Coprocessor