Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler

DEFACTO: Combining Parallelizing Compiler Technology with Hardware Behavioral Synthesis* Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 * The DEFACTO project was funded by the Information technology Office (ITO) of the Defense Advanced research project Agency (DARPA) under contract #F30602-98-2-0113.

Outline • Background & Motivation • Part 1: Application Mapping Example • Part 2: Design Space Exploration • Part 3: Challenges for Future FPGAs • Related Work • Conclusion

DEFACTO Objective & Goals • Objectives: • Automatically Map High-Level Applications to Field-Programmable Hardware (FPGAs) • Explore Multiple Design Choices • Goal: • Make Reconfigurable Technology Accessible to the Average Programmer

What Are FPGAs: • Key Concepts • Configurable Hardware • Reprogrammable (ms latency) • Architecture • Configurable Logic Blocks • “Universal” logic • Some input/outputs latched • Passive network between CLBs • Memories, processor cores

Why Use FPGAs? • Advantages over Application-Specific Integrated Circuits (ASICs) • Faster Time to Market • “Post silicon” Modification Possible • Reconfigurable, Possibly Even at Run-time • Advantages Over General-Purpose Processors • Application-Specific Customization (e.g., parallelism, small data-widths, arithmetic, bandwidth) • Disadvantages • Slow (typical automatic design @25MHz) • Low Density of Transistors

How to Program FPGAs? • Hardware-Oriented Languages • VHDL or Verilog • Very Low-Level Programming • Commercial Tools (e.g., MonetTM) • Choose Implementation Based on User Constraints • Time and Space Trade-Off • Provide Estimations for Implementation • Problem: Too Slow for Large Complex Designs • Place-and-Route Can Take up to 8 Hours for Large Designs • Unclear What to Do When Things Go Wrong

Behavioral Synthesis Example variable A is std_logic_vector(0..7) … X <= (A * B) - (C * D) + F 13 Registers 1 Multiplier 2 Adders/Subtractors 3 (shorter) clock cycles 9 Registers 2 Multipliers 2 Adders/Subtractors 2 (shorter) clock cycles 6 Registers 2 Multipliers 2 Adders/Subtractors 1 (long) clock cycle

Synthesizing FPGA Designs: Status • Technology Advances have led to Increasingly Large Parts • FPGAs now have Millions of “gates” • Current Practice is to Handcode Designs for FPGAs in Structural VHDL • Tedious and Error Prone • Requires Weeks to Months Even for Fairly Simple Designs • Higher-level Approach Needed!

DEFACTO: Key Ideas • Parallelizing Compiler Technology • Complements Behavioral Synthesis • Adjusts Parallelism and Data Reuse • Optimizes External Memory Accesses • Design Space Exploration • Evaluates and Compares Designs before Committing to Hardware • Improves Design Time Efficiency • a form of Feedback-directed Optimization

Opportunities: Parallelism & Storage Behavioral Synthesis Parallelizing Compiler Optimizations: Optimizations: Scalar Variables only Scalars & Multi-Dimensional Arrays inside Loop Body inside Loop Body & Across Iterations Supports User-Controlled Analysis Guides Automatic Loop Loop Unrolling Transformations Manages Registers and Evaluates Tradeoffs of Different inter-operator Communication Memories, On- and Off-chip Considers one FPGA System-level View Performs Allocation, Binding & No Knowledge of Hardware Scheduling of Hardware Implementation

Part 1: Mapping Complete Designs from C to FPGAs Sobel Edge Detection Example

Example - Sobel Edge Detection char img[IMAGE_SIZE][IMAGE_SIZE], edge [IMAGE_SIZE][IMAGE_SIZE]; int uh1, uh2, threshold; for (i=0; i < IMAGE_SIZE - 4; i++) { for (j=0; j < IMAGE_SIZE - 4; j++) { uh1= (((-img[i][j]) + (- (2* img[i+1][j])) + (-img[i+2][j])) + ((img[i][j-2]) + (2* img[i+1][j-2]) + (img[i+2][j-2]))); uh2= (((-img[i][j]) + (img[i+2][j])) + (- (2* img[i][j-1])) + (2* img[i+2][j-1]) + ((- img[i][j-2]) + (img[i][j-2]))); if ((abs(uh1) + abs(uh2)) < threshold) edge[i][j]=”0xFF”; else edge[i][j]=”0x00; } } threshold 1 0 -1 2 0 -2 1 0 -1 edge img -1 -2 -1 0 0 0 1 2 1

Sobel - A Naïve Implementation img[i][j] img[i][j+1] img[i][j+2] img[i+1][j+2] img[i+1][j] edge[i][j] 0x00 0xFF img[i+2][j] img[i+2][j+1] img[i+2][j+2] • Large Number of Adders and Multipliers (shifts in this case) • Too Many Memory Accesses ! • 8 Reads and 1 Write per Iteration of the Loop • Observation • Across 2 Iterations 4 out of 8 Values Can Be Reused

Data Reuse Analysis - Sobel img[i][j] img[i][j+1] img[i][j+2] img[i+1][j] img[i+1][j+2] img[i+2][j] img[i+2][j+1] img[i+2][j+2] d = (1,0) d = (2,0) d = (1,0) d = (0,1) img[i][j] img[i][j+1] img[i][j+2] d = (0,2) img[i+1][j] img[i+1][j+2] d = (0,1) img[i+2][j] img[i+2][j+1] img[i+2][j+2]

Data Reuse using Tapped-Delay Lines 0x00 0xFF • Reduce the Number of Memory Accesses • Exploit Array Layout and Distribution • Packing • Stripping • Examples: img[i][j] img[i][j+1] img[i][j+2] img[i][j] img[i+1][j] img[i+2][j] edge[i][j] edge[i][j] 0x00 0xFF Accesses = 1.0 + 1.0 + 1.0 + 1.0 = 4.0 Accesses = 0.25 + 0.25 + 0.25 + 0.25 = 1.0

Overall Design Approach • Application Data-paths • Extract Body of Loops • Uses Behavioral Synthesis • Memory Interfaces • Uses Data Access Patterns to Generate Channel Specs • VHDL Library Templates Application Data-path MEM MEM Application Data-path MEM MEM

WildStarTM: A Complex Memory Hierarchy 32bits 64bits Shared Memory0 Shared Memory1 SRAM0 FPGA 1 FPGA 0 FPGA 2 SRAM2 SRAM1 SRAM3 PCI Controller Shared Memory2 Shared Memory3 To Off-Board Memory

Project Status Algorithm Description • Complex Infrastructure • Different Programming Languages (C vs. VHDL) • Different EDA Tools • Different Vendors • Experimental Target • In-House Tools • Combines Compiler Techniques and Behavioral Synthesis • Different Execution Models • Reconcile Representation • It Works! • Fully Automated for Single FPGA designs • Modest Manual Intervention for Multi-FPGA designs (simulation OK) Compiler Analysis Design Space Exploration Code Transformations and Annotations SUIF2VHDL Computation & Data Partitioning Behavioral Synthesis & Estimation (Monet) Memory Access Protocols Logic Synthesis (Synplicity) Place & Route (Xilinx Foundations) Annapolis WildStar Board

Sobel on the Annapolis WildStar Board Input Image Output Image Manual vs. Automated

Part 2: Design Space Exploration Using Behavioral Synthesis Estimates

Design Space Exploration(Current Practice) Design Specification (Low-level VHDL) • 2 Weeks for a Working Design • 2 Months for an Optimized Design Validation / Evaluation Logic Synthesis / Place&Route Design Modification Correct? Good design?

Design Space Exploration (Our Approach) Algorithm (C/Fortran) • Compiler Optimizations (SUIF) • Unroll and Jam • Scalar Replacement • Custom Data Layout Unroll Factor Selection SUIF2VHDL Translation Behavioral Synthesis Estimation Logic Synthesis / Place&Route • Overall, Less than 2 hours • 5 Minutes for Optimized Design Selection

Problem Statement Space Requirements Execution Time Exploit parallelism, Reuse data on chip More copies of operators, More on-chip registers • Constraint: Size of design less than FPGA capacity • Goal: Minimal execution time • Selection Criteria: For given performance, minimal space • Frees up more space for other computations • Better clock rate achieved • Desirable to use on-chip space efficiently

Balance • Definition: Data Fetch Rate Consumption Rate • Consumption Rate[bits/cycle] = data bits consumed per computation time • Limited by the Data Dependences of the Computation • Data fetch Rate[bits/cycle] = data bits required per computation time • Limited by the FPGA’s Effective Memory Bandwidth • If balance > 1, Compute Bound • If balance < 1, Memory Bound • Balance suggests whether more resources should be devoted to enhance computation or storage.

Loop Unrolling • Exposes fine-grain parallelism by replicating the loop body. Do I=1,N, by 2 A(I) = A(I-2) + B(I) A(I+1) = A(I-1) + B(I+1) Do I=1, N A(I) = A(I-2) + B(I) 2 2 A(I-2) A(I) A(I) B(I) A(I) 2 B(I) A(I-1) A(I+1) B(I+1) • As Unrolling Factor Increases, both Data Fetch and Consumption Rate Increase.

Monotonicity Properties Data Fetch Rate (bits/cycle) Data Consumption Rate (bits/cycle) unroll factor Saturation point unroll factor Balance (= Fetch/Consumption) Saturation point: unroll factor that saturates memory bandwidth for a given architecture Saturation point unroll factor

Balance & Optimal Unroll Factor 1 2 3 4 5 Rate (bits/cycle) Data fetch rate Data consumption rate Optimal solution Unroll factor max Saturation point Balance Guides the Design Space Exploration.

Experiments • Multimedia Kernels • FIR (Finite Impulse Response) • Matrix Multiply • Sobel (Edge Detection) • Pattern Matching • Jacobi (Five Point Stencil) • Methodology • Compiler Translates C to SUIF and Behavioral VHDL • Synthesis Tool Estimates Space and Computational Latency • Compiler Computes Balance and Execution Time Accounting for Memory Latency • Memory Latency • Pipelined: 1 cycle for read and write • Non-pipelined: 7 cycles for read and 3 cycles for write

FIR Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 8 + + Outer Loop Unroll Factor 16 x x Outer Loop Unroll Factor 32 * Outer Loop Unroll Factor 64 * Selected Design Speedup: 17.26

Matrix Multiply Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 8 + + Outer Loop Unroll Factor 16 x x Outer Loop Unroll Factor 32 Selected Design Speedup: 13.36

Efficiency of Design Space Exploration • On average, only 0.3% (15%) of Space Searched.

FIR: Estimation vs. Accurate Data • Larger Designs Lead to Degradation in Clock Rates • Compiler Can Use a Statistical Approach to Derive Confidence Intervals for Space • Our case: Compiler Makes Correct Decision using Imperfect Data

Part 3: Challenges for Future FPGAs Heterogeneous Functional and Storage Resources Data/Computation Partitioning and Scheduling Revisited

Field-Programmable-Core-Arrays IP Core ARM • Large Number of Transistors • Multiple Application Specific Cores • Customization of Interconnect • Other Specialized Logic • Challenges: • Data Partitioning: • Custom Storage Structures • Allocation, Binding and Scheduling • Replication and Reorganization • Computation Partition • Scheduling between Cores • Coarse-Grain Pipelining • Revisiting Issues with Parallelizing Compiler Technology IP Core S-RAM D-RAM IP Core DSP

Related Work • Compilers for Special-purpose Configurable Architectures • PipeRench (CMU), RaPiD (UW), RAW (MIT) • High-level Languages Oriented towards Hardware • Handel-C, Cameron(CSU), PICO(HP), Napa-C (LANL) • Integrated Compiler and Logic Synthesis • Babb (MIT), Nimble (Synopsys) • Compiling from MatLab to FPGAs • Match compiler (Northwestern)

Conclusion Combines Behavioral Synthesis and Parallelizing Compiler Technologies Fast & Automated Design Space Exploration Trades Space with Functional Units via Loop Unrolling Uses Balance andMonotonicity Properties Searches only 0.3% of the Entire Design Space Near-optimal Performance and Smallest Space Future FPGAs Coarser-grained, Custom Functional and Storage Structures Multiprocessor on a Chip Data and Computation Partitioning and Coarse Grain Scheduling

Thank You

Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler

Pedro C. Diniz, Mary W. Hall, Joonseok Park, Byoungro So and Heidi Ziegler

Presentation Transcript

Hall C Compton Polarimeter

Eone Reference Laboratories Joonseok Park, MD Seoul Korea

C. Michael Hall

C/W

heidi

S. Patchkovskii and T. Ziegler

Heidi Van Parys, Thomas Nierhaus, Pedro Maciel, Steven Van Damme

Kumar Vanka, Mary Chan, Cory Pye and Tom Ziegler

Grand Park Hall

Mary C. Watzin