1 / 20

DARPA Sorting Benchmark On SRC Platform

DARPA Sorting Benchmark On SRC Platform. Gang Quan, Allen Michalski, Duncan Buell, James Davis Department of CSE University of South Carolina. Outline. Introduction DARPA Benchmark Suite SRC Platform Integer Sorting Algorithm Implementations Experiments and Results Discussions.

keala
Download Presentation

DARPA Sorting Benchmark On SRC Platform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DARPA SortingBenchmark On SRC Platform Gang Quan, Allen Michalski, Duncan Buell, James Davis Department of CSE University of South Carolina

  2. Outline • Introduction • DARPA Benchmark Suite • SRC Platform • Integer Sorting Algorithm Implementations • Experiments and Results • Discussions

  3. DARPA Benchmark Suite • Six benchmarks to measure performance of high productivity computing systems • Benchmark suite • Large shared memory random access • Matrix multiplication with multiprecise modular coefficients • A dynamic programming • Data transposition • Integer sort • Bit string pattern matching

  4. DARPA Integer Sorting Benchmark • Sorting a stream of n-bit unsigned integers of length N • In-place sorting is not required • Non-unique integers • In-core sort • N=106 , n=64 • N=5*107 ,n = 128 • Secondary memory sort • N=5*107 ,n = 64

  5. Host 1GB main memory 512K L2 cache MAP 2 XC2V6000 6 memory banks (24MB total) 800MB/s to/from main memory 4800MB/s to/from on-board memory SRC Architecture

  6. Software Platform • System • Linux (Red Hat 7.3) • Driver and Library additions • Compilers • Intel Compilers (C/C++, Fortran), static and run-time libraries • SRC Compilers (C/C++, Fortran) and FPGA Micro • Tools • FPGA • Symplicity Synplify Pro • Xilinx Alliance ISE

  7. SRC Compilation Process

  8. Integer Sorting Implementation • Software only (Proc_only) • FPGA Implementation • Multi-threading

  9. Software Only • Radix Sort • radix sort( A, radix_size) Let |A| represent the maximum binary bits in each element; for (i=1; i <= ceil( |A|/radix_size) ; ++i) bucket sort (A, on “digit” i ); • Priority queue sort

  10. Sorted Sublist 1 Sorted Sublist 2 Sorted Sublist 7 Sorted Sublist 8 Priority Queue Sort < < < < < < < . . .

  11. 128-bit FPGA Bubble Sorting

  12. Comparator cell (user micro) Pipelined Data valid bit for updating data in the comparator register O(N2) Resource usage 90% slice 128-bit FPGA Bubble Sorting (Cont’d)

  13. 128-bit FPGA Priority Queue Sorting • Map C function • Leaf node number 8 • Resource Usage: • 60% slices • 53% IOBs

  14. Multi-threading implementation Input Data Array Data Partitioning Host Radix Sorting (PC) FPGA Bubble Sorting Host Heap Sorting (PC) FPGA Heap Sorting Data Synchronization Data sorting 2-way Merge Sorting (PC) Output

  15. Experiments and Results • Data set • Randomly generated • N = 524288 (512K) • n = 128 bits • Total memory needed 512 * 16k = 8M (fits into two on-board memory banks) • Iteration one time • Time measurement • “CPU time” • getrusage() • Not a valid measurement since FPGA computation time is not counted • Wall clock time • gettimeofday() • Not accurate for multi user environment • Average data on 5 runs for the casual estimation

  16. Some Parameters Measured • Average map allocation time • 0.26 sec • Average map release time • 0.000041 sec • Data In time • 0.030878 sec • Data Out time • 0.051951 sec • Multi-threading overhead • 0.075 sec

  17. Include time for Data processing Map alloc/release time Data in/out Thread creating /scheduling/removing FPGA Only As much as 1.5 for small block size Proc_only wins out for large block size due to large cache effects Multi-threading Generally, a good trade-off between FPGA-only and Proc-only Not efficient when overhead becomes significant Execution Times (sec)

  18. O(N2) effect for hardwarebubble sorting When blocksize = 65536 (8 blocks) Hw priority queue = 0.083 sec SW priority queue = 0.261 sec Speedup = 3 times Detailed Timing (sec)

  19. Lessons learned • SRC Platform is generally easy to program • Easy to explore different design alternatives • Performance • Overhead • Map alloc/release, data movement, etc • Flexibility vs. performance • Hw priority queue • Knowledge on Map C compiler • Extra cycles for each loop in hw priority queue • Measurement accuracy • Elapsed time

  20. Work in progress • Using all hw resources • 2nd FPGA and all memory banks • Other more hw-efficient algorithm • Hw radix sort • Optimizing performance • Parallel execution

More Related