automatic mapping of khoros based applications to adaptive computing systems
Download
Skip this Video
Download Presentation
AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

Loading in 2 Seconds...

play fullscreen
1 / 31

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS. MAPLD-99 Laurel, MD September 29, 1999 Senthil Natarajan, Ben Levine, Chandra Tan, Danny Newport and Don Bouldin Electrical & Computer Engineering University of Tennessee Knoxville, TN 37996-2100

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS' - whitby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic mapping of khoros based applications to adaptive computing systems
AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

MAPLD-99

Laurel, MD

September 29, 1999

Senthil Natarajan, Ben Levine, Chandra Tan, Danny Newport and Don Bouldin

Electrical & Computer Engineering

University of Tennessee

Knoxville, TN 37996-2100

TEL: (423)-974-5444

FAX: (423)-974-8245

[email protected]

initial design capture and algorithm verification using khoros

Input Image

Output Image

Correctly Detected 3 Tanks But 1 False Target (Truck)

Low Pass

Filter

TargetRecon.

Region

Merge

Stat.

Calc.

Mask

Convol.

INITIAL DESIGN CAPTURE AND ALGORITHM VERIFICATION USING KHOROS
adaptive computing systems consist of accelerator boards of fpgas

Multiple FPGAs on a printed circuit board can be tightly coupled with a host CPU to accelerate low-level computations.

Our Wildforce Board has five Xilinx FPGAs (13K--85K gates), each with 512K x 32-bit RAM.

ADAPTIVE COMPUTING SYSTEMS CONSIST OF ACCELERATOR BOARDS OF FPGAS
current state of the art
CURRENT STATE-OF-THE-ART

KHOROS

MISSING LINK

PARTITIONING ONTO MULTIPLE FPGAS

MISSING LINK

champion will improve productivity
CHAMPION WILL IMPROVE PRODUCTIVITY

Manual Mapping Onto An Adaptive Computing System

KHOROS

  • GOAL: Automate the mapping of Khoros-based applications onto adaptive computing systems to improve designer productivity by 100x.
  • IMPACT:
    • More application designers will be able to achieve higher quality implementations in less time.
    • Adaptive computing systems will be utilized more effectively and by a wider audience.

ACS

TIME (WEEKS)

Champion Will Improve Productivity By Using Estimation and Automatic Mapping of Precompiled Library Primitives

KHOROS

ESTIMATION

ACS

TIME (WEEKS)

outline of this presentation
OUTLINE OF THIS PRESENTATION
  • Application Development Flow
  • Library Development and Verification
  • Manual Implementation
  • ATR Executions on ACS
  • Automated Partitioning Algorithms
  • Lessons Learned and Future Plans
application development flow
APPLICATION DEVELOPMENT FLOW

APPLICATION

KHOROS/CANTATA

DATA WIDTH MATCHING

& SYNCHRONIZATION

PARTITIONING

Precompiled

Libraries

SYNTHESIS &

PLACEMENT/ROUTING

Destination Hardware

Architecture

ADAPTIVE COMPUTING SYSTEM

algorithm structure

Target Pixel Map

Frame Map

Identify

Target

Region

InsertTarget

Frame

REPEAT 6 TIMES

Merge Images

Output Image

Input FLIR Image

ALGORITHM STRUCTURE

Find Targets and Label Image

The target pixel map is then used to identify square regions that are considered to contain targets. These target regions are then masked off (it is assumed that there is only one target per region). The target region location is then used to draw a frame that will identify the target in the output image. This is repeated six times.

develop and precompile library cells
DEVELOP AND PRECOMPILE LIBRARY CELLS

Test Inputs

Responses

KHOROS--C

Floating Point

KHOROS--C

Fixed Point

VHDL

Each Library Primitive Will Be Developed at Each Level, Verified, and Characterized.

FPGA

khoros and champion library cells
KHOROS AND CHAMPION LIBRARY CELLS
  • Khoros Traditional Cells:
  • Some Khoros cells have multiple functions for user to select.
  • A single cell can handle all input dimension sizes.
  • Cells can handle inputs of any data type.
  • Data between cells are stored on the host CPU system as temp files.
  • Khoros handles data movement between cells. Each cell begins its execution only after all its inputs have been written onto the host file system.
  • Champion Cells:
  • Each hardware cell has only one specific function and one data type.
  • Hardware cells are parametrized to correspond to the desired data bit widths.
  • Data is transferred between hardware cells sequentially one pixel at a time per clock cycle.
  • Synchronization of data arrival to hardware cells is necessary through the insertion of delay elements by Champion.

·

data bit widths must be matched

8

11

10

10

9

9

9

9

8

8

8

8

8

8

8

8

8

8

DATA BIT-WIDTHS MUST BE MATCHED

IN

CONVSTREAM_8_256_256

ADD

RIGHT SHIFT 3

ADD_8

ADD_9

CLIP HIGH 255

ADD_8

ADD_10

ADD_8

ADD_9

ADD_8

OUT

data must be synchronized due to different path delays

9

9

12

12

10

10

9

11

9

8

8

8

8

8

8

8

8

8

8

DATA MUST BE SYNCHRONIZED DUE TO DIFFERENT PATH DELAYS

Data synchronization error! Input times are not equal.

IN

T=0

11

T= 257

PAD_HIGH_8_11

L = 0

CONVSTREAM_8_256_256

L = 257

ADD_11

RIGHT_SHIFT_12_ 3

12

ADD_8

L = 1

T= 258

ADD_9

L = 1

T=260

T= 259

CLIP_HIGH_12_ 255

ADD_8

L = 1

T= 258

ADD_10

L = 1

ADD_8

L = 1

T= 258

T= 259

ADD_9

L = 1

TRUNCATE_HIGH_12_8

ADD_8

L = 1

T= 258

8

T=257

OUT

original khoros task graph
ORIGINAL KHOROS TASK GRAPH

S:32

R

RAM_Read_pf4_var_8

D:-

8

S:404

M

Sobel_8_8_256_256

D:262

S:346

8

M

Lowpass_8_8_256_256

D:262

8

S:354

M

START_Mean_SD

D:S+14

8

S:354

M

START_Mean_SD

8

D:S+14

8

8

8

8

9

S:0

shift_left_9_1

D:0

10

S:0

9

9

shift_left_10_2

D:0

S:12

8

add_9

10

D:1

10

S:13

10

add_10

D:1

S:4

and_1

11

D:1

1

S:168

M

Lowpass_1_4_256_256

D:262

1

D:0

4

8

8

8

S:9

S:11

gte_4_4

gte_8

D:1

D:1

S:11

1

1

1

S:168

gte_8

M

Lowpass_1_4_256_256

S:63

D:1

D:262

M

MITR

D:5

4

1

S:9

gte_4_4

D:1

A

hardware task graph with data bit width matched and synchronized
HARDWARE TASK GRAPH WITH DATA BIT-WIDTH MATCHED AND SYNCHRONIZED

S:32

R

RAM_Read_pf4_var_8

D:-

8

S:404

M

Sobel_8_8_256_256

D:262

S:346

8

M

Lowpass_8_8_256_256

D:262

8

S:354

M

START_Mean_SD

D:S+14

8

S:354

M

START_Mean_SD

8

D:S+14

8

S:56

R

RAM_buffer_pf4_8

S:0

8

8

D:16+S

pad_8_9

8

8

D:0

S:0

9

pad_8_9

S:0

S:0

D:0

pad_8_10

pad_8_10

S:0

D:0

D:0

shift_left_9_1

D:0

10

S:0

9

9

shift_left_10_2

D:0

S:12

8

add_9

10

D:1

10

S:13

10

S:56

add_10

R

RAM_buffer_pf4_8

D:1

S:4

D:16+S

and_1

S:11

11

D:1

clip_high_10_8

S:11

D:1

clip_high_11_8

1

D:1

10

11

S:168

M

Lowpass_1_4_256_256

S:0

S:0

D:262

1

trunc_high_11_8

trunc_high_10_8

D:0

D:0

4

8

8

8

S:9

S:11

gte_4_4

gte_8

D:1

D:1

S:11

1

1

1

S:168

gte_8

M

Lowpass_1_4_256_256

S:63

D:1

D:262

M

MITR

D:5

4

1

S:9

gte_4_4

D:1

A

our wildforce acs used as a linear array

Xilinx

4013XL Series

FPGA

Xilinx

4036XL Series

FPGA

Xilinx

4013XL Series

FPGA

Xilinx

4013XL Series

FPGA

Xilinx

4013XL Series

FPGA

Local RAM

Local RAM

Local RAM

Local RAM

Local RAM

OUR WILDFORCE ACS USED AS A LINEAR ARRAY

PCI Interface

Local Bus

32

= 36-bit Data Path

Crossbar

PE0

PE1

PE2

PE3

PE4

partition early instead of late to shorten the hardware mapping time

Technology

Hardware Configuration

Mapping

Synthesis

Design Input in Khoros

PARTITION EARLY INSTEAD OF LATE TO SHORTEN THE HARDWARE MAPPING TIME

EARLY

Precompiled Library Cells

Place & Route

Merge

P1

SUCCESS

Place & Route

Merge

P2

Design Input in Khoros

Workspace to Netlist

K-way partitioning + Global Place & Route

Place & Route

  • Coarser granularity -> smaller netlist.
  • Hierarchical and functional flow information are preserved.
  • Timing Synchronization greatly facilitated.
  • Less resource utilization.

Merge

P3

Place & Route

Merge

Pk

LATE

Optimizer

Flatten

Hardware Configuration

SUCCESS

P1

Place & Route

K-way partitioning + Global Place & Route

P2

Place & Route

VHDL

  • Finer granularity -> larger netlist.
  • Functional and algorithmic flow of the design are lost.
  • Timing Synchronization can be a problem.
  • More resource utilization.
  • The resulted subcircuits are more likely to be placeable and routable.

P3

Place & Route

Pk

Place & Route

multi fpga partitioning

N - P0 - P1

P0

P1

NETLIST N

P0

P1

N - P0 -P1 -P2

P2

N - P0

P0

P2

P1

P4

P3

P0

MULTI-FPGA PARTITIONING
timing results for atr on our wildforce
TIMING RESULTS FOR atr ON OUR WILDFORCE
  • OUR WILDFORCE ACS IS 156X FASTER THAN KHOROS/CPU NOW.
  • IF WE HAVE SUFFICIENT LOGIC AND MEMORY SUCH THAT NO RECONFIGURATIONS ARE NEEDED, THE ACS COULD BE 667X FASTER.
  • IF FULLY PIPELINED, THE ACS COULD BE 32,000X FASTER.

Data Processing

33

Data Transfer

34

Host Code

1544

Reconfiguration

5159

0

1000

2000

3000

4000

5000

6000

partitioning 1st board configuration phase
PARTITIONING - 1st BOARD CONFIGURATION PHASE

Blank Frame Map

Compute Edge Stats

Find First Target Pixel

RAM

RAM

Mask Target Pixels

Input Image

Check Intensity Stats

Mark Frame Pixels

11

11

4

4

4

554

500

PE1

PE3

Low-Pass Filter

RAM

AND

Sobel Filter

Compute Intensity Stats

Check Edge Stats

Write to RAM - A

11

11

Low-Pass Filter

Check >= 4

11

1296

Low-Pass Filter

Check >= 4

CPE0

Mask Invalid Target Region

72

4

548

PE2

PE4

partitioning 2nd board configuration phase
PARTITIONING - 2nd BOARD CONFIGURATION PHASE

Find First Target Pixel

Find First Target Pixel

RAM

RAM

Mask Target Pixels

Mask Target Pixels

Mark Frame Pixels

Mark Frame Pixels

4

4

4

4

500

500

PE1

PE3

Read from RAM - A

5

Find First Target Pixel

RAM

Write to RAM - B

53

Mask Target Pixels

CPE0

Mark Frame Pixels

4

4

72

500

PE2

PE4

partitioning 3rd board configuration phase
PARTITIONING - 3rd BOARD CONFIGURATION PHASE

Find First Target Pixel

RAM

Mask Target Pixels

Mark Frame Pixels

4

4

0

500

PE1

PE3

Read from RAM - B

5

Find First Target Pixel

RAM

Write to RAM - C

53

Mask Target Pixels

CPE0

Mark Frame Pixels

4

4

72

500

PE2

PE4

partitioning 4th board configuration phase
PARTITIONING - 4th BOARD CONFIGURATION PHASE

RAM

Read from RAM - C

Find Max Intensity

Combine Image and Frames

4

11

11

11

119

75

PE1

PE3

Input Image

11

Output Image

RAM

53

CPE0

4

11

11

72

90

PE2

PE4

productivity improvement is 100x 250 hours manually vs 2 5 hours automatically
PRODUCTIVITY IMPROVEMENT IS 100X(250 hours manually vs. 2.5 hours automatically)

Application

Khoros

Partitioning

Suite

Data Matching

Data Synchronization

WSP2NETLIST

NETLIST2STV

Synthesis/Place & Route

ACS

Automatic

Manual

time

lessons learned
LESSONS LEARNED
  • Learned that the translation from KHOROS to hardware is complicated by several factors including:
    • Differences in the way blocks of data are passed from operator to operator.
    • Parameters for data bit-widths must be specified for each cell.
    • Difference between data-driven KHOROS cells and clock-driven hardware cells creates a need for data synchronization.
  • Determined that reconfiguration time was the major obstacle to achieving high performance, and that RAM access conflicts required more reconfigurations than would be otherwise necessary.
  • Learned that manual implementation of KHOROS applications on WildForce is very time-consuming and tedious (250 hours).
  • Thus, great potential exists for making a significant (100x) improvement on productivity via automation.
schedule and milestones
SCHEDULE AND MILESTONES

May 98 Demonstrated the manual mapping of a simple KHOROS network on a Xilinx-based ACS (EVC-1). We also validated our method for library development at the KHOROS, VHDL and FPGA levels.

Sep 98 Demonstrated the manual mapping of a more complex KHOROS network on a Xilinx-based ACS (Wildforce).

Mar 99 Demonstrated the manual mapping of a complex KHOROS network with some automated FPGA partitioning on the Wildforce.

Sep 99 Automated additional portions of the application development flow.

Jan 00 Will demonstrate the Army Night Vision Lab challenge problem with automatic mapping onto the Wildforce.

Mar 00 Will demonstrate two additional challenge problems (e.g. Face Detection and Image Backprojection on the Wildforce).

Sep 00 Will demonstrate all three challenge problems on two additional ACS platforms (e.g. Altera-based ACS and latest Xilinx-Virtex ACS).

champion a software design environment for adaptive computing systems

KEY IDEAS

  • Khoros-based designs will be automatically linked to Adaptive Computing Systems.
  • Mapping onto ACS will be accelerated using precompiled library primitives (semicustom approach).
  • Metrics and visualization will reveal to application designer low-level details as warranted.
  • Application designer will guide the partitioning of tasks between hardware and software.

SCHEDULEMILESTONES

May 98 Demonstrated the manual mapping of a simple KHOROS network on a Xilinx-based ACS (EVC-1). We also validated our method for library development at the KHOROS, VHDL and FPGA levels.

Sep 98 Demonstrated the manual mapping of a more complex KHOROS network on a Xilinx-based ACS (Annapolis Microsystems Wildforce).

Mar 99 Demonstrated the manual mapping of a complex KHOROS network with some automated FPGA partitioning on the Wildforce.

Sep 99 Automated additional portions of the application development flow.

Jan 00 Will demonstrate the Army Night Vision Lab challenge problem withautomatic mapping onto the Wildforce.

Mar 00 Will demonstrate two additional challenge problems (e.g. Face Detection and Image Backprojection on the Wildforce).

Sep 00 Will demonstrate all three challenge problems on two additional ACS platforms (e.g. Altera-based ACS and latest Xilinx-Virtex ACS).

GOAL: Automate the mapping of Khoros-based applications onto adaptive computing systems to improve designer productivity by 100x.

IMPACT: More application designers will be able to achieve higher quality implementations in less time. Adaptive computing systems will be utilized more effectively and by a wider audience.

CHAMPION: A SOFTWARE DESIGN ENVIRONMENT FOR ADAPTIVE COMPUTING SYSTEMS
ad