1 / 20

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer. Russ Duren Baylor University, Waco, Texas Douglas Fouts and Dan Zulaica Naval Postgraduate School, Monterey, California . SRC-6E RECONFIGURABLE COMPUTER.

lycoris
Download Presentation

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer Russ Duren Baylor University, Waco, Texas Douglas Fouts and Dan Zulaica Naval Postgraduate School, Monterey, California 1

  2. SRC-6E RECONFIGURABLE COMPUTER • HARDWARE ARCHITECTURE CONSISTS OF 2 IDENTICAL HALVES • EACH SIDE HAS • 2 INTEL PENTIUM® PROCESSORS • 2 VIRTEX II SIX MILLION GATE FPGAS • MEMORY-TO-MEMORY CONNECTION • PROGRAMS CAN BE WRITTEN IN C OR FORTRAN • PORTIONS OF THE PROGRAM ARE CONVERTED TO CIRCUITRY IN THE FPGAS • GREATLY INCREASES EXECUTION SPEED 2

  3. SRC-6E HARDWARE ARCHITECTURE (1/2) μP Board MAP Intel® μP Intel® μP 315/195 MB/s (peak) Controller L2 L2 6x 800 MB/s MIOC On-Board Memory (24 MB) 6x 800 MB/s SNAP PCI Common Memory Chain Port 800 MB/s Chain Port 800 MB/s FPGA FPGA 3

  4. SRC-6E PROGRAMMING ENVIRONMENT • PENTIUM PROCESSORS • LINUX OPERATING SYSTEM IS THE MAIN USER INTERFACE • C AND FORTRAN COMPILERS FOR CODE DEVELOPMENT • INTEL AND GNU COMPILERS SUPPORTED • FPGAS • SRC-6E CUSTOM COMPILER CONVERTS HLL TO FPGA CIRCUITRY • FPGA ROUTINES WRITTEN IN C OR FORTRAN • SOFTWARE GENERATES ONE EXECUTABLE RESULT 4

  5. HARDWARE COMPILER • SRC COMPILER TRANSLATES THE C SOURCE INTO FPGA CIRCUITRY • CIRCUITRY IS DEEPLY PIPELINED • FPGAS ARE CLOCKED AT 100 MHz • ONCE THE PIPELINE IS FILLED (LATENCY) RESULTS ARE PRODUCED EVERY 10 NANOSECONDS 5

  6. LOWER LEVEL “PROGRAMMING” • SUPPORTS DEVELOPMENT OF FPGA “ROUTINES” USING TRADITIONAL TOOLS: • VERILOG AND VHDL • SCHEMATIC CAPTURE • INTELLECTUAL PROPERTY (IP) CORES • THESE METHODS ARE ANALOGOUS TO PROGRAMMING IN ASSEMBLY LANGUAGE FOR A NORMAL PROCESSOR • IP CORES ARE ANALOGOUS TO A LIBRARY OF ASSEMBLY LANGUAGE ROUTINES 6

  7. EVALUATING THE SRC C HARDWARE COMPILER • INTENDED TO ENABLE C PROGRAMMERS TO TAKE ADVANTAGE OF THE SRC-6E WITHOUT REQUIRING A KNOWLEDGE OF DIGITAL CIRCUIT DESIGN • HOW DO YOU EVALUATE THE COMPILER? • EASE OF USE – OBJECTIVE MEASURES? • SCOPE OF MODIFICATIONS REQUIRED TO USE THE FPGAS • LIMITATIONS IMPOSED ON HLL ROUTINES • TYPICAL MEASURES OF COMPILER PERFORMANCE • OBJECT CODE SIZE - TRANSLATES TO FPGA CIRCUIT AREA • CODE EXECUTION SPEED 7

  8. MODIFICATIONS REQUIRED TO USE THE FPGAS • CODE FOR FPGAS IS ISOLATED IN TO AN EXTERNAL FUNCTION • CALLS ARE INSERTED IN THE MAIN PROGRAM TO • INITIALIZE THE SRC • CALL THE EXTERNAL FUNCTION • RELEASE THE SRC (OPTIONAL) • EXTERNAL FUNCTION INCLUDES CALLS TO • TRANSFER INPUT DATA FROM SYSTEM COMMON MEMORY TO ON-BOARD MEMORY • TRANSFER OUTPUT DATA FROM ON-BOARD MEMORY TO SYSTEM COMMON MEMORY 8

  9. LIMITATIONS IMPOSED ON HLL ROUTINES • VERSION 1.3 SUPPORTS: • DATA TYPES: 32-BIT (INT) AND 64-BIT (LONG LONG) • ADD, SUBTRACT, MULTIPLY • DIVISION (32-BIT ONLY) • RELATIONAL OPERATORS (==, !=, <, >, <=, >=) • BITWISE OPERATORS AND SHIFTS (&, |, !, <<, >>) • LOGICAL OPERATORS (&&, !, ||) • SQRT() (32BIT ONLY) • SOME RESTRICTIONS ON VARIABLE ACCESS AND CONTROL STATEMENTS • NOT PARTICULARLY LIMITING, BUT MUST BE CONSIDERED • EXAMPLE: NUMBER OF DATA WORDS TRANSFERRED MUST BE MULTIPLE OF 4 • FUTURE VERSIONS OF THE COMPILER WILL SUPPORT MORE OPERATIONS AND DATA TYPES AND RELAX RESTRICTIONS 9

  10. TESTING HARDWARE COMPILER PERFORMANCE • WHAT BENCHMARK DO YOU USE FOR CIRCUIT AREA AND EXECUTION SPEED? • TYPICAL COMPILER EVALUATIONS COMPARE COMPILED CODE TO HAND-CODED ASSEMBLY LANGUAGE ROUTINES. • COMPARED COMPILER OUTPUT TO RESULTS OBTAINED USING INTELLECTUAL PROPERTY CORES • SELECTED CORDIC ALGORITHM AS BASIS OF COMPARISON • USES SHIFTS AND ADDS TO CALCULATE ARCTANGENT OF TWO NUMBERS • WELL SUITED TO IMPLEMENTATION ON AN FPGA • IP CORE AVAILABLE AT NO COST • USED XILINX CORE GENERATOR PROGRAM TO DEVELOP CORES 10

  11. CORE GENERATOR RESULTS 11

  12. TESTING COMPILER EFFICIENCY • CORES VARIED IN INTERNAL PRECISION AND LATENCY • ALTHOUGH INPUTS TO CORES WERE 8 BITS WIDE, THE INTERFACE TO THE CALLING ROUTINE USED 32-BIT INTEGERS • DEVELOPED C ROUTINE TO CALCULATE CORDIC • MOST CLOSELY MATCHED CORE NUMBER 2 • 32-BIT INTEGER VARIABLES • 11 ITERATIONS • WROTE COMMON “MAIN” PROGRAM • CALLED CORDIC ROUTINE 256 TIMES • EACH CALL CALCULATED 249984 ARCTANGENTS 12

  13. RESULTS *33,792 SLICES AVAILABLE 13

  14. CIRCUIT AREA COMPARISON • C CODE REQUIRED 2 – 5 TIMES MORE CIRCUIT AREA • WHERE DOES THE FACTOR OF 2 COME FROM? • TOTAL CIRCUIT AREA • 6710 SLICES COMPARED TO 3555 FOR MOST SIMILAR CORE • WHERE DOES THE FACTOR OF 5 COME FROM? • AREA OF CORDIC ROUTINE ITSELF • APPROXIMATELY 2800 SLICES ADDED TO IP CORE FOR INTERFACE • SUBTRACTING 2800 FROM 6710 SLICES SAYS THE C CORDIC REQUIRED ABOUT 3900 SLICES • 3900 COMPARED TO 777 IS A RATIO OF 5 TO 1 14

  15. CIRCUIT AREA IMPLICATIONS • CIRCUIT AREA EFFICIENCY LIMITS PROGRAM SIZE • EXTRA CIRCUIT AREA CAN BE USED FOR LOOP UNROLLING AND PARALLELIZATION TO INCREASE EXECUTION SPEED • MAY REQUIRE USE OF IP CORES OR CUSTOM CIRCUIT DESIGN • EXAMPLE: USED 8 COPIES OF CORE 1 IN PARALLEL. • RESULTING CIRCUIT REQUIRED ONLY 6814 SLICES • APPROXIMATELY SAME AS ONE CORDIC WRITTEN IN C • IF THE PROGRAM FITS, THEN THIS DOESN’T MATTER • FPGA DENSITY IS EXCEEDING MOORE’S LAW 15

  16. EXECUTION SPEED COMPARISON • TOTAL EXECUTION TIME IS APPROXIMATELY IDENTICAL AT 7.4 SEC • LATENCY FOR C CODE IS ABOUT TWICE THAT OF IP CORES • LATENCY IS INSIGNIFICANT WITH LARGE DATA BLOCKS Time per loop = (Latency – 1 + Number of Calculations) * 10 ns (112 – 1 + 249984) * 10 ns = 2.5 msec/loop • CALCULATION TIME IS SMALL COMPARED TO EXECUTION TIME OF 7.4 SEC 256 loops * 2.5 msec = 0.64 seconds • MEMORY TRANSFER TIME DOMINATES EXECUTION TIME • WITH PEAK TRANSFER RATE MEMORY TRANSFERS REQUIRE 4 SECONDS 16

  17. COMPUTATIONAL THROUGHPUT SUMMARY • COMPUTATIONAL THROUGHPUT FOR C ROUTINES EQUALS IP CORES FOR MANY PROBLEMS • WHEN HLL ROUTINE CAN BE FULLY PIPELINED AND • NUMBER OF CALCULATIONS >> LATENCY OR • I/O TIME >> CALCULATION TIME • OFTEN NO SIGNIFICANT TIMING IMPACT FOR USE OF C 17

  18. ADVANTAGES OF HIGH LEVEL LANGUAGES • DOES NOT REQUIRE AN ADDITIONAL SKILL SET • NO HARDWARE KNOWLEDGE REQUIRED • INCREASED DESIGN FLEXIBILITY • PROGRAMMER IMPLEMENTS ALGORITHM AS DESIRED • LIMITED TO DATA WIDTHS RECOGNIZED BY THE COMPILER • LOWEST COST METHOD FOR APPLICATIONS WHERE IP CORES DO NOT EXIST • AVOIDS PURCHASE OR ROYALTY COST OF IP CORE WHEN ONE IS AVAILABLE 18

  19. ADVANTAGES OF IP CORES • EASE OF DEVELOPMENT • VERY LOW DEVELOPMENT EFFORT • PREVIOUSLY VALIDATED (HOPEFULLY) • EASILY RECONFIGURED • LOWER LATENCY • OFTEN ONLY A MINOR ADVANTAGE • IN SOME CASES IT MAY BE CRITICAL • MORE COMPACT CIRCUIT REALIZATION • 2 TO 5 TIMES LESS FPGA CIRCUITRY NEEDED • ALLOWS SOLUTION OF MORE COMPLEX PROBLEMS • ENABLES INCREASED PARALLELISM OR LOOP UNROLLING 19

  20. CONCLUSIONS • SRC-6E COMPILER ALLOWS C AND FORTRAN PROGRAMMERS TO ACCELERATE PROGRAMS WITHOUT BEING CIRCUIT DESIGNERS • HARDWARE COMPILER PRODUCES CODE THAT COMPARES WELL TO CUSTOM IP CORES • TYPICALLY SAME EXECUTION SPEED • REQUIRES MORE FPGA CIRCUIT AREA • PROVIDES CAPABILITY TO INTEGRATE IP CORES OR CUSTOM CIRCUIT DESIGNS 20

More Related