A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee*, HyungGyu Lee*, Soonhoi Ha†, Jongman Kim*, and ChrysostomosNicopoulos‡ Presented by Junghee Lee * † ‡

Introduction Fusion Massively Parallel Processing Array Many Core Multi Core Powerful cores + H/W accelerator in a single die Ex) AMD Fusion Single Core Programmable Hardware Accelerator Ex) GPGPU

MPPA as Hardware Accelerator Host CPU Interface Core Tile Core Tile Core Tile Core Tile Device Memory CPU CPU Massively Parallel Processing Array Core Tile Core Tile Core Tile Core Tile CPU CPU Core Tile Core Tile Core Tile Core Tile I/O Core Tile Core Tile Core Tile Core Tile I/O Core Tile Core Tile Core Tile Core Tile Challenges Expressiveness Debugging Memory Hierarchy Design Core Tile Core Tile Core Tile Core Tile

Related Works Expressiveness Debugging Memory GPGPU AMD Fusion SIMD Multiple debuggers Event graph Scratch-pad memory Cache Tilera Multi-threading Multiple debuggers Coherent cache Rigel Multi-threading Not addressed Software-managed cache Ambric Kahn process network Formal model Scratch-pad memory Proposed MPPA Event-driven model Inter-module debug Intra-module debug Scratch-pad memory Prefetching

Contents • Introduction • Execution Model • Hardware Architecture • Evaluation • Conclusion

Execution Model • Specification • Module = (b, Pi, Po, C, F) • b = Behavior of module • Pi = Input ports • Po = Output ports • C = Sensitivity list • F = Prefetch list • Signal • Net = (d, K) • d = Driver port • K = A set of sink ports • Semantics • A module is triggered when any signal connected to C changes • Function calls and memory accesses are limited to within a module • Non-blocking write and block read • The specification can be modified during run-time

Example • Quick sort • Pivot is selected • The given array is partitioned so that • The left segment should contain smaller elements than the pivot • The right segment should contain larger elements than the pivot • Recursively partition the left and right segments • Specifying quick sort • Multi-threading • OK but hard to debug • SIMD • Inefficient due to input dependency • Kahn process network • Impossible due to the dynamic nature

Specify Quick Sort with Event-driven Model • Partition module • b (behavior): select a pivot, partition the input array, instantiate another partition module if necessary • Pi (input port): input array and its position • Po (output port): left and right segments and their position • C (sensitivity list): input array • P (prefetch list): input array • Collection module • b (behavior): collect segments • Pi (input port): sorted segments and intermediate result • Po (output port): final result and intermediate result • C (sensitivity list): sorted segments • P (prefetch list): sorted segments and intermediate result Input array Partition Partition Partition … Collection Final result Intermediate result

MPPA Microarhitecture • Identical core tiles • Consists of uCPU, scratch-pad memory, and peripherals that support the execution model • One core tile is designated to an execution engine Core Tile Core Tile Core Tile Core Tile Core Tile Host CPU Interface Device Memory Core Tile Core Tile Core Tile Core Tile Core Tile E Core Tile Core Tile Core Tile Core Tile Core Tile • Software running on a core tile • Consists of scheduler, signal storage and interconnect directory • Supports the execution model • If necessary, it is split into multiple instances running on different core tiles Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile

Core Tile Architecture • Software-managed on-chip SRAM • Double buffering where one is for the current module and the other is for the next module to be prefetched Scratch Pad Memory uCPU For Current Module For Next Module Prefetcher • Switches the context when the current module finishes and the next module is ready • Stores information about the modules • Prefetches the code and data of the next module while the current module is running on uCPU • Stores the input data • The actual data is stored in the SPM while its information is managed by this module • Generic small processor • Treated as a black box • Stores the output data • Notifies the update event to the interconnect directory when the output is updated Context Manager Message Handler Input Signal Queue • Counter-part of the prefetcher • Sends data to the requester Output Signal Queue • Handles the system messages Message Queue • NoC router Network Interface

Execution Engine • Most of its functionality is implemented in software while the hardware facilitates communication Software implementation gives us flexibility in the number and location of the execution engine • One way to visualize our MPPA is to regard the execution engine as an event-driven simulation kernel • The execution engine interacts with modules running on other core tiles through messages

Components of Execution Engine • Scheduler • Keeps track of the status and location of modules • Maintains three queues: wait, ready and run queue • Signal storage • Stores signal values in the device memory • If a signal is updated but its value is still stored in the node, the signal storage invalidates its value and keeps the location of the latest value • Interconnect directory • Keeps track of connectivity of signals and ports • Maintains the sensitivity list

Module-Level Prefetching • Hides the overhead of the dynamic scheduling • Prefetches the next module while the current module is running uCPU Prefetcher Scheduler Interconn. Directory Signal Storage Other Node Execute a module Memory access Memory access

Illustrative Example Partition 3 Partition 4 Partition 5 Partition 0 Partition 4 Partition 1 Partition 2 uCPU uCPU uCPU Prefetcher Prefetcher Prefetcher Out Sig Q Out Sig Q Out Sig Q Msg Handler Msg Handler Msg Handler Interconnect Directory Scheduler Collection Wait Q Collection Ready Q Collection Signal Storage Run Q

Benchmark • Recognition, Synthesis and Mining (RMS) benchmark • Fine-grained parallelism: dominated by short tasks • Small memory foot print • High run-time scheduling overhead • Task-level parallelism: exhibits dependency • Hard to be implemented with GPGPU

Simulator • In-house cycle-level simulator • Parameters

Utilization 1.0 0.8 Core utilization 0.6 0.4 0.2 0 OP FS BS CF CED BT QS Benchmarks w/o prefetching w/ prefetching

Scalability 1.0 20000 Execution time (cycles) 18000 0.8 Core utilization 16000 0.6 14000 0.4 12000 0.2 10000 0 8000 64 24 32 40 48 56 Number of core tiles Util (1) Execution time (1) Util (3) Execution time (3)

Conclusion • This paper proposes a novel MPPA architecture that employs an event-driven execution model • Handles dependencies by dynamic scheduling • Hides dynamic scheduling overhead by module-level prefetching • Future works • Supports applications that require larger memory footprint • Adjusts the number of execution engines dynamically • Supports inter-module debugging

Questions? Contact info Junghee Lee junghee.lee@gatech.edu Electrical and Computer Engineering Georgia Institute of Technology

Thank you!

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching

Presentation Transcript

Digital Signal Processing at 1GHz in a Field-Programmable Object Array

Dynamic User Task Scheduling for Mobile Robots

Array Processing

Instruction-Level Parallelism Dynamic Scheduling

Orchestrated Scheduling and Prefetching for GPGPUs

Instruction-Level Parallelism dynamic scheduling

High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations

Field Programmable Gate Array

Multi-Level Programmable Array

Dynamically Programmable Array Architecture

Programmable Logic Array

Programmable Counter Array PCA

Dynamically Programmable Array Architecture

A Survey on Scheduling Methods of Task-Parallel Processing

Evaluation of Dynamic NAND-NAND Programmable Logic Array

Programmable Array Logic (PAL)

Field Programmable Gate Array

Application-level Prefetching

Dynamic Array

Field Programmable Gate Array

Array Processing

Dynamic Scheduling and Dynamic Percolation