DARPA Sorting Benchmark On SRC Platform

DARPA SortingBenchmark On SRC Platform Gang Quan, Allen Michalski, Duncan Buell, James Davis Department of CSE University of South Carolina

Outline • Introduction • DARPA Benchmark Suite • SRC Platform • Integer Sorting Algorithm Implementations • Experiments and Results • Discussions

DARPA Benchmark Suite • Six benchmarks to measure performance of high productivity computing systems • Benchmark suite • Large shared memory random access • Matrix multiplication with multiprecise modular coefficients • A dynamic programming • Data transposition • Integer sort • Bit string pattern matching

DARPA Integer Sorting Benchmark • Sorting a stream of n-bit unsigned integers of length N • In-place sorting is not required • Non-unique integers • In-core sort • N=106 , n=64 • N=5*107 ,n = 128 • Secondary memory sort • N=5*107 ,n = 64

Host 1GB main memory 512K L2 cache MAP 2 XC2V6000 6 memory banks (24MB total) 800MB/s to/from main memory 4800MB/s to/from on-board memory SRC Architecture

Software Platform • System • Linux (Red Hat 7.3) • Driver and Library additions • Compilers • Intel Compilers (C/C++, Fortran), static and run-time libraries • SRC Compilers (C/C++, Fortran) and FPGA Micro • Tools • FPGA • Symplicity Synplify Pro • Xilinx Alliance ISE

SRC Compilation Process

Integer Sorting Implementation • Software only (Proc_only) • FPGA Implementation • Multi-threading

Software Only • Radix Sort • radix sort( A, radix_size) Let |A| represent the maximum binary bits in each element; for (i=1; i <= ceil( |A|/radix_size) ; ++i) bucket sort (A, on “digit” i ); • Priority queue sort

Sorted Sublist 1 Sorted Sublist 2 Sorted Sublist 7 Sorted Sublist 8 Priority Queue Sort < < < < < < < . . .

128-bit FPGA Bubble Sorting

Comparator cell (user micro) Pipelined Data valid bit for updating data in the comparator register O(N2) Resource usage 90% slice 128-bit FPGA Bubble Sorting (Cont’d)

128-bit FPGA Priority Queue Sorting • Map C function • Leaf node number 8 • Resource Usage: • 60% slices • 53% IOBs

Multi-threading implementation Input Data Array Data Partitioning Host Radix Sorting (PC) FPGA Bubble Sorting Host Heap Sorting (PC) FPGA Heap Sorting Data Synchronization Data sorting 2-way Merge Sorting (PC) Output

Experiments and Results • Data set • Randomly generated • N = 524288 (512K) • n = 128 bits • Total memory needed 512 * 16k = 8M (fits into two on-board memory banks) • Iteration one time • Time measurement • “CPU time” • getrusage() • Not a valid measurement since FPGA computation time is not counted • Wall clock time • gettimeofday() • Not accurate for multi user environment • Average data on 5 runs for the casual estimation

Some Parameters Measured • Average map allocation time • 0.26 sec • Average map release time • 0.000041 sec • Data In time • 0.030878 sec • Data Out time • 0.051951 sec • Multi-threading overhead • 0.075 sec

Include time for Data processing Map alloc/release time Data in/out Thread creating /scheduling/removing FPGA Only As much as 1.5 for small block size Proc_only wins out for large block size due to large cache effects Multi-threading Generally, a good trade-off between FPGA-only and Proc-only Not efficient when overhead becomes significant Execution Times (sec)

O(N2) effect for hardwarebubble sorting When blocksize = 65536 (8 blocks) Hw priority queue = 0.083 sec SW priority queue = 0.261 sec Speedup = 3 times Detailed Timing (sec)

Lessons learned • SRC Platform is generally easy to program • Easy to explore different design alternatives • Performance • Overhead • Map alloc/release, data movement, etc • Flexibility vs. performance • Hw priority queue • Knowledge on Map C compiler • Extra cycles for each loop in hw priority queue • Measurement accuracy • Elapsed time

Work in progress • Using all hw resources • 2nd FPGA and all memory banks • Other more hw-efficient algorithm • Hw radix sort • Optimizing performance • Parallel execution

DARPA Sorting Benchmark On SRC Platform

DARPA Sorting Benchmark On SRC Platform

Presentation Transcript

Sorting Buffers on HSTs

The DARPA

RATING-SRC

Dynamic Sensor Networks DARPA Review User Platform

Lower Bound on Sorting

Sorting on Skip Chains

DARPA

SRC Characteristics

DARPA

DARPA - darpa.mil

DARPA

DARPA

DARPA

Lower Bound on Sorting

DARPA introduction

First-hand experience on porting MATPHOT code to SRC platform

SRC PRESENTATION

First-hand experience on porting MATPHOT code to SRC platform

DARPA

Lower Bound on Sorting

DARPA OLE