performance comparison of cordic implementations on the src 6e reconfigurable computer l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer PowerPoint Presentation
Download Presentation
Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer

Loading in 2 Seconds...

play fullscreen
1 / 20

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer - PowerPoint PPT Presentation


  • 180 Views
  • Uploaded on

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer. Russ Duren Baylor University, Waco, Texas Douglas Fouts and Dan Zulaica Naval Postgraduate School, Monterey, California . SRC-6E RECONFIGURABLE COMPUTER.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
performance comparison of cordic implementations on the src 6e reconfigurable computer

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer

Russ Duren

Baylor University, Waco, Texas

Douglas Fouts and Dan Zulaica

Naval Postgraduate School, Monterey, California

1

src 6e reconfigurable computer
SRC-6E RECONFIGURABLE COMPUTER
  • HARDWARE ARCHITECTURE CONSISTS OF 2 IDENTICAL HALVES
  • EACH SIDE HAS
    • 2 INTEL PENTIUM® PROCESSORS
    • 2 VIRTEX II SIX MILLION GATE FPGAS
    • MEMORY-TO-MEMORY CONNECTION
  • PROGRAMS CAN BE WRITTEN IN C OR FORTRAN
  • PORTIONS OF THE PROGRAM ARE CONVERTED TO CIRCUITRY IN THE FPGAS
    • GREATLY INCREASES EXECUTION SPEED

2

src 6e hardware architecture 1 2
SRC-6E HARDWARE ARCHITECTURE (1/2)

μP Board

MAP

Intel® μP

Intel® μP

315/195 MB/s (peak)

Controller

L2

L2

6x 800 MB/s

MIOC

On-Board Memory (24 MB)

6x 800 MB/s

SNAP

PCI

Common

Memory

Chain Port 800 MB/s

Chain Port 800 MB/s

FPGA

FPGA

3

src 6e programming environment
SRC-6E PROGRAMMING ENVIRONMENT
  • PENTIUM PROCESSORS
    • LINUX OPERATING SYSTEM IS THE MAIN USER INTERFACE
    • C AND FORTRAN COMPILERS FOR CODE DEVELOPMENT
    • INTEL AND GNU COMPILERS SUPPORTED
  • FPGAS
    • SRC-6E CUSTOM COMPILER CONVERTS HLL TO FPGA CIRCUITRY
    • FPGA ROUTINES WRITTEN IN C OR FORTRAN
  • SOFTWARE GENERATES ONE EXECUTABLE RESULT

4

hardware compiler
HARDWARE COMPILER
  • SRC COMPILER TRANSLATES THE C SOURCE INTO FPGA CIRCUITRY
  • CIRCUITRY IS DEEPLY PIPELINED
  • FPGAS ARE CLOCKED AT 100 MHz
  • ONCE THE PIPELINE IS FILLED (LATENCY) RESULTS ARE PRODUCED EVERY 10 NANOSECONDS

5

lower level programming
LOWER LEVEL “PROGRAMMING”
  • SUPPORTS DEVELOPMENT OF FPGA “ROUTINES” USING TRADITIONAL TOOLS:
    • VERILOG AND VHDL
    • SCHEMATIC CAPTURE
    • INTELLECTUAL PROPERTY (IP) CORES
  • THESE METHODS ARE ANALOGOUS TO PROGRAMMING IN ASSEMBLY LANGUAGE FOR A NORMAL PROCESSOR
  • IP CORES ARE ANALOGOUS TO A LIBRARY OF ASSEMBLY LANGUAGE ROUTINES

6

evaluating the src c hardware compiler
EVALUATING THE SRC C HARDWARE COMPILER
  • INTENDED TO ENABLE C PROGRAMMERS TO TAKE ADVANTAGE OF THE SRC-6E WITHOUT REQUIRING A KNOWLEDGE OF DIGITAL CIRCUIT DESIGN
  • HOW DO YOU EVALUATE THE COMPILER?
  • EASE OF USE – OBJECTIVE MEASURES?
    • SCOPE OF MODIFICATIONS REQUIRED TO USE THE FPGAS
    • LIMITATIONS IMPOSED ON HLL ROUTINES
  • TYPICAL MEASURES OF COMPILER PERFORMANCE
    • OBJECT CODE SIZE - TRANSLATES TO FPGA CIRCUIT AREA
    • CODE EXECUTION SPEED

7

modifications required to use the fpgas
MODIFICATIONS REQUIRED TO USE THE FPGAS
  • CODE FOR FPGAS IS ISOLATED IN TO AN EXTERNAL FUNCTION
  • CALLS ARE INSERTED IN THE MAIN PROGRAM TO
    • INITIALIZE THE SRC
    • CALL THE EXTERNAL FUNCTION
    • RELEASE THE SRC (OPTIONAL)
  • EXTERNAL FUNCTION INCLUDES CALLS TO
    • TRANSFER INPUT DATA FROM SYSTEM COMMON MEMORY TO ON-BOARD MEMORY
    • TRANSFER OUTPUT DATA FROM ON-BOARD MEMORY TO SYSTEM COMMON MEMORY

8

limitations imposed on hll routines
LIMITATIONS IMPOSED ON HLL ROUTINES
  • VERSION 1.3 SUPPORTS:
    • DATA TYPES: 32-BIT (INT) AND 64-BIT (LONG LONG)
    • ADD, SUBTRACT, MULTIPLY
    • DIVISION (32-BIT ONLY)
    • RELATIONAL OPERATORS (==, !=, <, >, <=, >=)
    • BITWISE OPERATORS AND SHIFTS (&, |, !, <<, >>)
    • LOGICAL OPERATORS (&&, !, ||)
    • SQRT() (32BIT ONLY)
  • SOME RESTRICTIONS ON VARIABLE ACCESS AND CONTROL STATEMENTS
    • NOT PARTICULARLY LIMITING, BUT MUST BE CONSIDERED
    • EXAMPLE: NUMBER OF DATA WORDS TRANSFERRED MUST BE MULTIPLE OF 4
  • FUTURE VERSIONS OF THE COMPILER WILL SUPPORT MORE OPERATIONS AND DATA TYPES AND RELAX RESTRICTIONS

9

testing hardware compiler performance
TESTING HARDWARE COMPILER PERFORMANCE
  • WHAT BENCHMARK DO YOU USE FOR CIRCUIT AREA AND EXECUTION SPEED?
  • TYPICAL COMPILER EVALUATIONS COMPARE COMPILED CODE TO HAND-CODED ASSEMBLY LANGUAGE ROUTINES.
  • COMPARED COMPILER OUTPUT TO RESULTS OBTAINED USING INTELLECTUAL PROPERTY CORES
  • SELECTED CORDIC ALGORITHM AS BASIS OF COMPARISON
    • USES SHIFTS AND ADDS TO CALCULATE ARCTANGENT OF TWO NUMBERS
    • WELL SUITED TO IMPLEMENTATION ON AN FPGA
    • IP CORE AVAILABLE AT NO COST
  • USED XILINX CORE GENERATOR PROGRAM TO DEVELOP CORES

10

testing compiler efficiency
TESTING COMPILER EFFICIENCY
  • CORES VARIED IN INTERNAL PRECISION AND LATENCY
  • ALTHOUGH INPUTS TO CORES WERE 8 BITS WIDE, THE INTERFACE TO THE CALLING ROUTINE USED 32-BIT INTEGERS
  • DEVELOPED C ROUTINE TO CALCULATE CORDIC
    • MOST CLOSELY MATCHED CORE NUMBER 2
    • 32-BIT INTEGER VARIABLES
    • 11 ITERATIONS
  • WROTE COMMON “MAIN” PROGRAM
    • CALLED CORDIC ROUTINE 256 TIMES
    • EACH CALL CALCULATED 249984 ARCTANGENTS

12

results
RESULTS

*33,792 SLICES AVAILABLE

13

circuit area comparison
CIRCUIT AREA COMPARISON
  • C CODE REQUIRED 2 – 5 TIMES MORE CIRCUIT AREA
  • WHERE DOES THE FACTOR OF 2 COME FROM?
    • TOTAL CIRCUIT AREA
    • 6710 SLICES COMPARED TO 3555 FOR MOST SIMILAR CORE
  • WHERE DOES THE FACTOR OF 5 COME FROM?
    • AREA OF CORDIC ROUTINE ITSELF
    • APPROXIMATELY 2800 SLICES ADDED TO IP CORE FOR INTERFACE
    • SUBTRACTING 2800 FROM 6710 SLICES SAYS THE C CORDIC REQUIRED ABOUT 3900 SLICES
    • 3900 COMPARED TO 777 IS A RATIO OF 5 TO 1

14

circuit area implications
CIRCUIT AREA IMPLICATIONS
  • CIRCUIT AREA EFFICIENCY LIMITS PROGRAM SIZE
  • EXTRA CIRCUIT AREA CAN BE USED FOR LOOP UNROLLING AND PARALLELIZATION TO INCREASE EXECUTION SPEED
  • MAY REQUIRE USE OF IP CORES OR CUSTOM CIRCUIT DESIGN
  • EXAMPLE: USED 8 COPIES OF CORE 1 IN PARALLEL.
    • RESULTING CIRCUIT REQUIRED ONLY 6814 SLICES
    • APPROXIMATELY SAME AS ONE CORDIC WRITTEN IN C
  • IF THE PROGRAM FITS, THEN THIS DOESN’T MATTER
  • FPGA DENSITY IS EXCEEDING MOORE’S LAW

15

execution speed comparison
EXECUTION SPEED COMPARISON
  • TOTAL EXECUTION TIME IS APPROXIMATELY IDENTICAL AT 7.4 SEC
  • LATENCY FOR C CODE IS ABOUT TWICE THAT OF IP CORES
  • LATENCY IS INSIGNIFICANT WITH LARGE DATA BLOCKS

Time per loop = (Latency – 1 + Number of Calculations) * 10 ns

(112 – 1 + 249984) * 10 ns = 2.5 msec/loop

  • CALCULATION TIME IS SMALL COMPARED TO EXECUTION TIME OF 7.4 SEC

256 loops * 2.5 msec = 0.64 seconds

  • MEMORY TRANSFER TIME DOMINATES EXECUTION TIME
    • WITH PEAK TRANSFER RATE MEMORY TRANSFERS REQUIRE 4 SECONDS

16

computational throughput summary
COMPUTATIONAL THROUGHPUT SUMMARY
  • COMPUTATIONAL THROUGHPUT FOR C ROUTINES EQUALS IP CORES FOR MANY PROBLEMS
    • WHEN HLL ROUTINE CAN BE FULLY PIPELINED AND
    • NUMBER OF CALCULATIONS >> LATENCY OR
    • I/O TIME >> CALCULATION TIME
  • OFTEN NO SIGNIFICANT TIMING IMPACT FOR USE OF C

17

advantages of high level languages
ADVANTAGES OF HIGH LEVEL LANGUAGES
  • DOES NOT REQUIRE AN ADDITIONAL SKILL SET
    • NO HARDWARE KNOWLEDGE REQUIRED
  • INCREASED DESIGN FLEXIBILITY
    • PROGRAMMER IMPLEMENTS ALGORITHM AS DESIRED
    • LIMITED TO DATA WIDTHS RECOGNIZED BY THE COMPILER
  • LOWEST COST METHOD FOR APPLICATIONS WHERE IP CORES DO NOT EXIST
  • AVOIDS PURCHASE OR ROYALTY COST OF IP CORE WHEN ONE IS AVAILABLE

18

advantages of ip cores
ADVANTAGES OF IP CORES
  • EASE OF DEVELOPMENT
    • VERY LOW DEVELOPMENT EFFORT
    • PREVIOUSLY VALIDATED (HOPEFULLY)
    • EASILY RECONFIGURED
  • LOWER LATENCY
    • OFTEN ONLY A MINOR ADVANTAGE
    • IN SOME CASES IT MAY BE CRITICAL
  • MORE COMPACT CIRCUIT REALIZATION
    • 2 TO 5 TIMES LESS FPGA CIRCUITRY NEEDED
    • ALLOWS SOLUTION OF MORE COMPLEX PROBLEMS
    • ENABLES INCREASED PARALLELISM OR LOOP UNROLLING

19

conclusions
CONCLUSIONS
  • SRC-6E COMPILER ALLOWS C AND FORTRAN PROGRAMMERS TO ACCELERATE PROGRAMS WITHOUT BEING CIRCUIT DESIGNERS
  • HARDWARE COMPILER PRODUCES CODE THAT COMPARES WELL TO CUSTOM IP CORES
    • TYPICALLY SAME EXECUTION SPEED
    • REQUIRES MORE FPGA CIRCUIT AREA
  • PROVIDES CAPABILITY TO INTEGRATE IP CORES OR CUSTOM CIRCUIT DESIGNS

20