performance comparison of cordic implementations on the src 6e reconfigurable computer l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer PowerPoint Presentation
Download Presentation
Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer

Loading in 2 Seconds...

play fullscreen
1 / 20

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer - PowerPoint PPT Presentation


  • 176 Views
  • Uploaded on

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer. Russ Duren Baylor University, Waco, Texas Douglas Fouts and Dan Zulaica Naval Postgraduate School, Monterey, California . SRC-6E RECONFIGURABLE COMPUTER.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer' - lycoris


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
performance comparison of cordic implementations on the src 6e reconfigurable computer

Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer

Russ Duren

Baylor University, Waco, Texas

Douglas Fouts and Dan Zulaica

Naval Postgraduate School, Monterey, California

1

src 6e reconfigurable computer
SRC-6E RECONFIGURABLE COMPUTER
  • HARDWARE ARCHITECTURE CONSISTS OF 2 IDENTICAL HALVES
  • EACH SIDE HAS
    • 2 INTEL PENTIUM® PROCESSORS
    • 2 VIRTEX II SIX MILLION GATE FPGAS
    • MEMORY-TO-MEMORY CONNECTION
  • PROGRAMS CAN BE WRITTEN IN C OR FORTRAN
  • PORTIONS OF THE PROGRAM ARE CONVERTED TO CIRCUITRY IN THE FPGAS
    • GREATLY INCREASES EXECUTION SPEED

2

src 6e hardware architecture 1 2
SRC-6E HARDWARE ARCHITECTURE (1/2)

μP Board

MAP

Intel® μP

Intel® μP

315/195 MB/s (peak)

Controller

L2

L2

6x 800 MB/s

MIOC

On-Board Memory (24 MB)

6x 800 MB/s

SNAP

PCI

Common

Memory

Chain Port 800 MB/s

Chain Port 800 MB/s

FPGA

FPGA

3

src 6e programming environment
SRC-6E PROGRAMMING ENVIRONMENT
  • PENTIUM PROCESSORS
    • LINUX OPERATING SYSTEM IS THE MAIN USER INTERFACE
    • C AND FORTRAN COMPILERS FOR CODE DEVELOPMENT
    • INTEL AND GNU COMPILERS SUPPORTED
  • FPGAS
    • SRC-6E CUSTOM COMPILER CONVERTS HLL TO FPGA CIRCUITRY
    • FPGA ROUTINES WRITTEN IN C OR FORTRAN
  • SOFTWARE GENERATES ONE EXECUTABLE RESULT

4

hardware compiler
HARDWARE COMPILER
  • SRC COMPILER TRANSLATES THE C SOURCE INTO FPGA CIRCUITRY
  • CIRCUITRY IS DEEPLY PIPELINED
  • FPGAS ARE CLOCKED AT 100 MHz
  • ONCE THE PIPELINE IS FILLED (LATENCY) RESULTS ARE PRODUCED EVERY 10 NANOSECONDS

5

lower level programming
LOWER LEVEL “PROGRAMMING”
  • SUPPORTS DEVELOPMENT OF FPGA “ROUTINES” USING TRADITIONAL TOOLS:
    • VERILOG AND VHDL
    • SCHEMATIC CAPTURE
    • INTELLECTUAL PROPERTY (IP) CORES
  • THESE METHODS ARE ANALOGOUS TO PROGRAMMING IN ASSEMBLY LANGUAGE FOR A NORMAL PROCESSOR
  • IP CORES ARE ANALOGOUS TO A LIBRARY OF ASSEMBLY LANGUAGE ROUTINES

6

evaluating the src c hardware compiler
EVALUATING THE SRC C HARDWARE COMPILER
  • INTENDED TO ENABLE C PROGRAMMERS TO TAKE ADVANTAGE OF THE SRC-6E WITHOUT REQUIRING A KNOWLEDGE OF DIGITAL CIRCUIT DESIGN
  • HOW DO YOU EVALUATE THE COMPILER?
  • EASE OF USE – OBJECTIVE MEASURES?
    • SCOPE OF MODIFICATIONS REQUIRED TO USE THE FPGAS
    • LIMITATIONS IMPOSED ON HLL ROUTINES
  • TYPICAL MEASURES OF COMPILER PERFORMANCE
    • OBJECT CODE SIZE - TRANSLATES TO FPGA CIRCUIT AREA
    • CODE EXECUTION SPEED

7

modifications required to use the fpgas
MODIFICATIONS REQUIRED TO USE THE FPGAS
  • CODE FOR FPGAS IS ISOLATED IN TO AN EXTERNAL FUNCTION
  • CALLS ARE INSERTED IN THE MAIN PROGRAM TO
    • INITIALIZE THE SRC
    • CALL THE EXTERNAL FUNCTION
    • RELEASE THE SRC (OPTIONAL)
  • EXTERNAL FUNCTION INCLUDES CALLS TO
    • TRANSFER INPUT DATA FROM SYSTEM COMMON MEMORY TO ON-BOARD MEMORY
    • TRANSFER OUTPUT DATA FROM ON-BOARD MEMORY TO SYSTEM COMMON MEMORY

8

limitations imposed on hll routines
LIMITATIONS IMPOSED ON HLL ROUTINES
  • VERSION 1.3 SUPPORTS:
    • DATA TYPES: 32-BIT (INT) AND 64-BIT (LONG LONG)
    • ADD, SUBTRACT, MULTIPLY
    • DIVISION (32-BIT ONLY)
    • RELATIONAL OPERATORS (==, !=, <, >, <=, >=)
    • BITWISE OPERATORS AND SHIFTS (&, |, !, <<, >>)
    • LOGICAL OPERATORS (&&, !, ||)
    • SQRT() (32BIT ONLY)
  • SOME RESTRICTIONS ON VARIABLE ACCESS AND CONTROL STATEMENTS
    • NOT PARTICULARLY LIMITING, BUT MUST BE CONSIDERED
    • EXAMPLE: NUMBER OF DATA WORDS TRANSFERRED MUST BE MULTIPLE OF 4
  • FUTURE VERSIONS OF THE COMPILER WILL SUPPORT MORE OPERATIONS AND DATA TYPES AND RELAX RESTRICTIONS

9

testing hardware compiler performance
TESTING HARDWARE COMPILER PERFORMANCE
  • WHAT BENCHMARK DO YOU USE FOR CIRCUIT AREA AND EXECUTION SPEED?
  • TYPICAL COMPILER EVALUATIONS COMPARE COMPILED CODE TO HAND-CODED ASSEMBLY LANGUAGE ROUTINES.
  • COMPARED COMPILER OUTPUT TO RESULTS OBTAINED USING INTELLECTUAL PROPERTY CORES
  • SELECTED CORDIC ALGORITHM AS BASIS OF COMPARISON
    • USES SHIFTS AND ADDS TO CALCULATE ARCTANGENT OF TWO NUMBERS
    • WELL SUITED TO IMPLEMENTATION ON AN FPGA
    • IP CORE AVAILABLE AT NO COST
  • USED XILINX CORE GENERATOR PROGRAM TO DEVELOP CORES

10

testing compiler efficiency
TESTING COMPILER EFFICIENCY
  • CORES VARIED IN INTERNAL PRECISION AND LATENCY
  • ALTHOUGH INPUTS TO CORES WERE 8 BITS WIDE, THE INTERFACE TO THE CALLING ROUTINE USED 32-BIT INTEGERS
  • DEVELOPED C ROUTINE TO CALCULATE CORDIC
    • MOST CLOSELY MATCHED CORE NUMBER 2
    • 32-BIT INTEGER VARIABLES
    • 11 ITERATIONS
  • WROTE COMMON “MAIN” PROGRAM
    • CALLED CORDIC ROUTINE 256 TIMES
    • EACH CALL CALCULATED 249984 ARCTANGENTS

12

results
RESULTS

*33,792 SLICES AVAILABLE

13

circuit area comparison
CIRCUIT AREA COMPARISON
  • C CODE REQUIRED 2 – 5 TIMES MORE CIRCUIT AREA
  • WHERE DOES THE FACTOR OF 2 COME FROM?
    • TOTAL CIRCUIT AREA
    • 6710 SLICES COMPARED TO 3555 FOR MOST SIMILAR CORE
  • WHERE DOES THE FACTOR OF 5 COME FROM?
    • AREA OF CORDIC ROUTINE ITSELF
    • APPROXIMATELY 2800 SLICES ADDED TO IP CORE FOR INTERFACE
    • SUBTRACTING 2800 FROM 6710 SLICES SAYS THE C CORDIC REQUIRED ABOUT 3900 SLICES
    • 3900 COMPARED TO 777 IS A RATIO OF 5 TO 1

14

circuit area implications
CIRCUIT AREA IMPLICATIONS
  • CIRCUIT AREA EFFICIENCY LIMITS PROGRAM SIZE
  • EXTRA CIRCUIT AREA CAN BE USED FOR LOOP UNROLLING AND PARALLELIZATION TO INCREASE EXECUTION SPEED
  • MAY REQUIRE USE OF IP CORES OR CUSTOM CIRCUIT DESIGN
  • EXAMPLE: USED 8 COPIES OF CORE 1 IN PARALLEL.
    • RESULTING CIRCUIT REQUIRED ONLY 6814 SLICES
    • APPROXIMATELY SAME AS ONE CORDIC WRITTEN IN C
  • IF THE PROGRAM FITS, THEN THIS DOESN’T MATTER
  • FPGA DENSITY IS EXCEEDING MOORE’S LAW

15

execution speed comparison
EXECUTION SPEED COMPARISON
  • TOTAL EXECUTION TIME IS APPROXIMATELY IDENTICAL AT 7.4 SEC
  • LATENCY FOR C CODE IS ABOUT TWICE THAT OF IP CORES
  • LATENCY IS INSIGNIFICANT WITH LARGE DATA BLOCKS

Time per loop = (Latency – 1 + Number of Calculations) * 10 ns

(112 – 1 + 249984) * 10 ns = 2.5 msec/loop

  • CALCULATION TIME IS SMALL COMPARED TO EXECUTION TIME OF 7.4 SEC

256 loops * 2.5 msec = 0.64 seconds

  • MEMORY TRANSFER TIME DOMINATES EXECUTION TIME
    • WITH PEAK TRANSFER RATE MEMORY TRANSFERS REQUIRE 4 SECONDS

16

computational throughput summary
COMPUTATIONAL THROUGHPUT SUMMARY
  • COMPUTATIONAL THROUGHPUT FOR C ROUTINES EQUALS IP CORES FOR MANY PROBLEMS
    • WHEN HLL ROUTINE CAN BE FULLY PIPELINED AND
    • NUMBER OF CALCULATIONS >> LATENCY OR
    • I/O TIME >> CALCULATION TIME
  • OFTEN NO SIGNIFICANT TIMING IMPACT FOR USE OF C

17

advantages of high level languages
ADVANTAGES OF HIGH LEVEL LANGUAGES
  • DOES NOT REQUIRE AN ADDITIONAL SKILL SET
    • NO HARDWARE KNOWLEDGE REQUIRED
  • INCREASED DESIGN FLEXIBILITY
    • PROGRAMMER IMPLEMENTS ALGORITHM AS DESIRED
    • LIMITED TO DATA WIDTHS RECOGNIZED BY THE COMPILER
  • LOWEST COST METHOD FOR APPLICATIONS WHERE IP CORES DO NOT EXIST
  • AVOIDS PURCHASE OR ROYALTY COST OF IP CORE WHEN ONE IS AVAILABLE

18

advantages of ip cores
ADVANTAGES OF IP CORES
  • EASE OF DEVELOPMENT
    • VERY LOW DEVELOPMENT EFFORT
    • PREVIOUSLY VALIDATED (HOPEFULLY)
    • EASILY RECONFIGURED
  • LOWER LATENCY
    • OFTEN ONLY A MINOR ADVANTAGE
    • IN SOME CASES IT MAY BE CRITICAL
  • MORE COMPACT CIRCUIT REALIZATION
    • 2 TO 5 TIMES LESS FPGA CIRCUITRY NEEDED
    • ALLOWS SOLUTION OF MORE COMPLEX PROBLEMS
    • ENABLES INCREASED PARALLELISM OR LOOP UNROLLING

19

conclusions
CONCLUSIONS
  • SRC-6E COMPILER ALLOWS C AND FORTRAN PROGRAMMERS TO ACCELERATE PROGRAMS WITHOUT BEING CIRCUIT DESIGNERS
  • HARDWARE COMPILER PRODUCES CODE THAT COMPARES WELL TO CUSTOM IP CORES
    • TYPICALLY SAME EXECUTION SPEED
    • REQUIRES MORE FPGA CIRCUIT AREA
  • PROVIDES CAPABILITY TO INTEGRATE IP CORES OR CUSTOM CIRCUIT DESIGNS

20