Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform

Generation of Custom DSP Transform IP Cores: Case StudyWalsh-Hadamard Transform Fang Fang James C. Hoe Markus Püschel Smarahara Misra Carnegie Mellon University

? • May not match the application’s needs: • parameters, speed, power, area and their trade-off. Conventional Approach:Static IP Cores • IP cores improve productivity and reduce time-to-market. • e.g. Xilinx LogiCore library: FFT for N=16, 64, 256 and 1024 on 16-bit complex numbers chip library application

Alternative Approach: IP Core Generation • Generate IP cores to match specific application requirements (speed, area, power, numerical accuracy, and I/O bandwidth…) Application parameters Generator+Evaluator Speed / area / power requirements Optimized IP cores

Design Space Designer’s Focus Design space • DSP transform design can be studied at several levels. • More math knowledge involved  Bigger design space to explore. Math Algorithm Architecture Gate Circuit

Problem • Problem: gap between transform mathematics and hardware design A hardware engineer A math guy What I know: Finite state machinePipeliningSystolic array … What I know: Linear algebraDigital signal processingAdaptive filter theory …

Formula example RepresentationManipulationMapping Formula Bridge: Formula • Solution: - Formula representation of DSP transforms - Automated formula manipulation and mapping A hardware engineer A math guy What I know: Finite state machinePipeliningSystolic array … What I know: Linear algebraDigital signal processingAdaptive filter theory …

Outline • Introduction • Technical Details(illustrated by WHT transform) • What are the degreesof design freedom? • How do we explore this design space? • Experimental Results • Summary and Future work

n fold Tensor productA  B = [ ak,l• B ], where A = [ak,l] Walsh-Hadamard Transform • Why WHT? • Typical access pattern for a DSP transform • Close to 2-power FFT • Study important construct  • Definition

an F2 block Subtraction Addition WHT23 From Formula to Architecture

Pease Algorithm Stride permutation L2N

0 1 Regular routing 2 3 4 5 Possibility for vertical folding 6 7 Possibility for horizontal folding Pease Algorithm an F2 block 0 1 2 3 4 5 6 7 12 F2 blocks total

2 2 HF + Vertical Folding (VF) Repeat 3 times 1 F2 block Folded L28 Folding 8 Repeat 3 times Horizontal Folding (HF) 4 F2 blocks 8 L28 I4F2 ?

Challenge in Vertical Folding • Straightforward approach: Memory-based reordering • Extra control logic to reorder address • Computation speed is limited by memory speed • Ad-hoc approach: Register routing • Hard to automate the process • Our approach: formula-based matrix factorization How to fold these wires? L2N Folded L28 N ports Q ports

0 0 L2Q has Q input portsQ=2q, N=2n 1 JN can be easily folded [1] 4 2 2 3 6 4 L24 (J4)4 (J64)4 (J32)4 (J16)4 (J8)4 1 5 5 6 3 Example of (L264)4 (N=64, Q = 4) 7 7 7 [1]. J.H.Takala etc., “Multi-Port Interconnection Networks for Radix-R Algorithms”, ICASSP01 Factorization of Stride Permutation J8

Freedom in Horizontal Folding • WHT2n has n horizontal stages in the flattened design • The divisors of n are all the possible folding degrees • Example: HF degrees of WHT26 can be 1, 2, 3, 6 • Effects of more horizontal folding degree • Less pipeline depth •  lower • throughput

Freedom in Vertical Folding • WHT2n has 2n vertical ports in the flattened design • 1, 2, 4… 2n-1 are all possible folding degrees • Example: VF degrees of WHT26 could be 1, 2, 4, … 32 • Effects of more vertical folding degree • Less I/O bandwidth •  longer • computation

Outline • Introduction • Technical Details • Experimental Results • Summary and Future work

Design Space Exploration HF factor(1,2,3,6) VF factor(1,2,4, ... 32) X = 24 different designs Transform size(64) Bit-width (8) WHT Generator Technology Libary Xilinx FPGA Synthesis Performance requirement Evaluator Xilinx FPGA Place&Route

To achieve the same area, multiple folding options are available. Area vs. Folding Degrees

Latency vs. Folding Degrees (WHT64)

Latency is almost unaffected by HF, except comparing flattened design with folded design Latency vs. Folding Degrees (WHT64)

Throughput vs. Folding Degrees Folding always lowers throughput

Our fastest design Design in [2] Comparison with an Existing Design • WHT8 • 8 bit fixed-point • FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 • Compare our fastest generated designs against results reported by Amira, et al. [2] 60% more area 80% reduction in latency13 times higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

Our smallest design Design in [2] Comparison with an Existing Design • WHT8 • 8 bit fixed-point • FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 • Compare our smallest generated designs against results reported by Amira, et al. [2] Less areaShorter latencyHigher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

Performance throughput / latency / area … Verticalfolding Horizontal folding Summary • Large performance variations over the design space of horizontal and vertical folding • Automatic design space exploration through formula manipulation and mapping can find the best trade-off

RepresentationManipulationMapping Formula PipeliningSystolic arrayDistributed ArithmeticFix-point vs. Floating-point … DFTDCTDSTDWT … Future work More DSPtransforms More design decisions

Thank you! Contact: Fang Fang Email: ffang@cmu.edu URL: www.ece.cmu.edu/~ffang

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform