Automatic mapping of khoros based applications to adaptive computing systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS. MAPLD-99 Laurel, MD September 29, 1999 Senthil Natarajan, Ben Levine, Chandra Tan, Danny Newport and Don Bouldin Electrical & Computer Engineering University of Tennessee Knoxville, TN 37996-2100

Download Presentation

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic mapping of khoros based applications to adaptive computing systems

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

MAPLD-99

Laurel, MD

September 29, 1999

Senthil Natarajan, Ben Levine, Chandra Tan, Danny Newport and Don Bouldin

Electrical & Computer Engineering

University of Tennessee

Knoxville, TN 37996-2100

TEL: (423)-974-5444

FAX: (423)-974-8245

[email protected]


Initial design capture and algorithm verification using khoros

Input Image

Output Image

Correctly Detected 3 Tanks But 1 False Target (Truck)

Low Pass

Filter

TargetRecon.

Region

Merge

Stat.

Calc.

Mask

Convol.

INITIAL DESIGN CAPTURE AND ALGORITHM VERIFICATION USING KHOROS


Khoros cantata is a visual programming language for prototyping algorithms

KHOROS/CANTATA IS A VISUAL PROGRAMMING LANGUAGE FOR PROTOTYPING ALGORITHMS


Adaptive computing systems consist of accelerator boards of fpgas

Multiple FPGAs on a printed circuit board can be tightly coupled with a host CPU to accelerate low-level computations.

Our Wildforce Board has five Xilinx FPGAs (13K--85K gates), each with 512K x 32-bit RAM.

ADAPTIVE COMPUTING SYSTEMS CONSIST OF ACCELERATOR BOARDS OF FPGAS


Current state of the art

CURRENT STATE-OF-THE-ART

KHOROS

MISSING LINK

PARTITIONING ONTO MULTIPLE FPGAS

MISSING LINK


Champion will automatically map khoros designs onto adaptive computing systems

CHAMPION WILL AUTOMATICALLY MAP KHOROS DESIGNS ONTO ADAPTIVE COMPUTING SYSTEMS


Champion will improve productivity

CHAMPION WILL IMPROVE PRODUCTIVITY

Manual Mapping Onto An Adaptive Computing System

KHOROS

  • GOAL: Automate the mapping of Khoros-based applications onto adaptive computing systems to improve designer productivity by 100x.

  • IMPACT:

    • More application designers will be able to achieve higher quality implementations in less time.

    • Adaptive computing systems will be utilized more effectively and by a wider audience.

ACS

TIME (WEEKS)

Champion Will Improve Productivity By Using Estimation and Automatic Mapping of Precompiled Library Primitives

KHOROS

ESTIMATION

ACS

TIME (WEEKS)


Outline of this presentation

OUTLINE OF THIS PRESENTATION

  • Application Development Flow

  • Library Development and Verification

  • Manual Implementation

  • ATR Executions on ACS

  • Automated Partitioning Algorithms

  • Lessons Learned and Future Plans


Application development flow

APPLICATION DEVELOPMENT FLOW

APPLICATION

KHOROS/CANTATA

DATA WIDTH MATCHING

& SYNCHRONIZATION

PARTITIONING

Precompiled

Libraries

SYNTHESIS &

PLACEMENT/ROUTING

Destination Hardware

Architecture

ADAPTIVE COMPUTING SYSTEM


Khoros cantata implementation top level

KHOROS/CANTATA IMPLEMENTATIONTOP LEVEL


Khoros cantata implementation find targets

KHOROS/CANTATA IMPLEMENTATION--> FIND TARGETS


Khoros cantata implementation mark frame pixels

KHOROS/CANTATA IMPLEMENTATION--> MARK FRAME PIXELS


Algorithm structure

Target Pixel Map

Frame Map

Identify

Target

Region

InsertTarget

Frame

REPEAT 6 TIMES

Merge Images

Output Image

Input FLIR Image

ALGORITHM STRUCTURE

Find Targets and Label Image

The target pixel map is then used to identify square regions that are considered to contain targets. These target regions are then masked off (it is assumed that there is only one target per region). The target region location is then used to draw a frame that will identify the target in the output image. This is repeated six times.


Develop and precompile library cells

DEVELOP AND PRECOMPILE LIBRARY CELLS

Test Inputs

Responses

KHOROS--C

Floating Point

KHOROS--C

Fixed Point

VHDL

Each Library Primitive Will Be Developed at Each Level, Verified, and Characterized.

FPGA


Khoros and champion library cells

KHOROS AND CHAMPION LIBRARY CELLS

  • Khoros Traditional Cells:

  • Some Khoros cells have multiple functions for user to select.

  • A single cell can handle all input dimension sizes.

  • Cells can handle inputs of any data type.

  • Data between cells are stored on the host CPU system as temp files.

  • Khoros handles data movement between cells. Each cell begins its execution only after all its inputs have been written onto the host file system.

  • Champion Cells:

  • Each hardware cell has only one specific function and one data type.

  • Hardware cells are parametrized to correspond to the desired data bit widths.

  • Data is transferred between hardware cells sequentially one pixel at a time per clock cycle.

  • Synchronization of data arrival to hardware cells is necessary through the insertion of delay elements by Champion.

·


Data bit widths must be matched

8

11

10

10

9

9

9

9

8

8

8

8

8

8

8

8

8

8

DATA BIT-WIDTHS MUST BE MATCHED

IN

CONVSTREAM_8_256_256

ADD

RIGHT SHIFT 3

ADD_8

ADD_9

CLIP HIGH 255

ADD_8

ADD_10

ADD_8

ADD_9

ADD_8

OUT


Data must be synchronized due to different path delays

9

9

12

12

10

10

9

11

9

8

8

8

8

8

8

8

8

8

8

DATA MUST BE SYNCHRONIZED DUE TO DIFFERENT PATH DELAYS

Data synchronization error! Input times are not equal.

IN

T=0

11

T= 257

PAD_HIGH_8_11

L = 0

CONVSTREAM_8_256_256

L = 257

ADD_11

RIGHT_SHIFT_12_ 3

12

ADD_8

L = 1

T= 258

ADD_9

L = 1

T=260

T= 259

CLIP_HIGH_12_ 255

ADD_8

L = 1

T= 258

ADD_10

L = 1

ADD_8

L = 1

T= 258

T= 259

ADD_9

L = 1

TRUNCATE_HIGH_12_8

ADD_8

L = 1

T= 258

8

T=257

OUT


Original khoros task graph

ORIGINAL KHOROS TASK GRAPH

S:32

R

RAM_Read_pf4_var_8

D:-

8

S:404

M

Sobel_8_8_256_256

D:262

S:346

8

M

Lowpass_8_8_256_256

D:262

8

S:354

M

START_Mean_SD

D:S+14

8

S:354

M

START_Mean_SD

8

D:S+14

8

8

8

8

9

S:0

shift_left_9_1

D:0

10

S:0

9

9

shift_left_10_2

D:0

S:12

8

add_9

10

D:1

10

S:13

10

add_10

D:1

S:4

and_1

11

D:1

1

S:168

M

Lowpass_1_4_256_256

D:262

1

D:0

4

8

8

8

S:9

S:11

gte_4_4

gte_8

D:1

D:1

S:11

1

1

1

S:168

gte_8

M

Lowpass_1_4_256_256

S:63

D:1

D:262

M

MITR

D:5

4

1

S:9

gte_4_4

D:1

A


Hardware task graph with data bit width matched and synchronized

HARDWARE TASK GRAPH WITH DATA BIT-WIDTH MATCHED AND SYNCHRONIZED

S:32

R

RAM_Read_pf4_var_8

D:-

8

S:404

M

Sobel_8_8_256_256

D:262

S:346

8

M

Lowpass_8_8_256_256

D:262

8

S:354

M

START_Mean_SD

D:S+14

8

S:354

M

START_Mean_SD

8

D:S+14

8

S:56

R

RAM_buffer_pf4_8

S:0

8

8

D:16+S

pad_8_9

8

8

D:0

S:0

9

pad_8_9

S:0

S:0

D:0

pad_8_10

pad_8_10

S:0

D:0

D:0

shift_left_9_1

D:0

10

S:0

9

9

shift_left_10_2

D:0

S:12

8

add_9

10

D:1

10

S:13

10

S:56

add_10

R

RAM_buffer_pf4_8

D:1

S:4

D:16+S

and_1

S:11

11

D:1

clip_high_10_8

S:11

D:1

clip_high_11_8

1

D:1

10

11

S:168

M

Lowpass_1_4_256_256

S:0

S:0

D:262

1

trunc_high_11_8

trunc_high_10_8

D:0

D:0

4

8

8

8

S:9

S:11

gte_4_4

gte_8

D:1

D:1

S:11

1

1

1

S:168

gte_8

M

Lowpass_1_4_256_256

S:63

D:1

D:262

M

MITR

D:5

4

1

S:9

gte_4_4

D:1

A


Our wildforce acs used as a linear array

Xilinx

4013XL Series

FPGA

Xilinx

4036XL Series

FPGA

Xilinx

4013XL Series

FPGA

Xilinx

4013XL Series

FPGA

Xilinx

4013XL Series

FPGA

Local RAM

Local RAM

Local RAM

Local RAM

Local RAM

OUR WILDFORCE ACS USED AS A LINEAR ARRAY

PCI Interface

Local Bus

32

= 36-bit Data Path

Crossbar

PE0

PE1

PE2

PE3

PE4


Partition early instead of late to shorten the hardware mapping time

Technology

Hardware Configuration

Mapping

Synthesis

Design Input in Khoros

PARTITION EARLY INSTEAD OF LATE TO SHORTEN THE HARDWARE MAPPING TIME

EARLY

Precompiled Library Cells

Place & Route

Merge

P1

SUCCESS

Place & Route

Merge

P2

Design Input in Khoros

Workspace to Netlist

K-way partitioning + Global Place & Route

Place & Route

  • Coarser granularity -> smaller netlist.

  • Hierarchical and functional flow information are preserved.

  • Timing Synchronization greatly facilitated.

  • Less resource utilization.

Merge

P3

Place & Route

Merge

Pk

LATE

Optimizer

Flatten

Hardware Configuration

SUCCESS

P1

Place & Route

K-way partitioning + Global Place & Route

P2

Place & Route

VHDL

  • Finer granularity -> larger netlist.

  • Functional and algorithmic flow of the design are lost.

  • Timing Synchronization can be a problem.

  • More resource utilization.

  • The resulted subcircuits are more likely to be placeable and routable.

P3

Place & Route

Pk

Place & Route


Multi fpga partitioning

N - P0 - P1

P0

P1

NETLIST N

P0

P1

N - P0 -P1 -P2

P2

N - P0

P0

P2

P1

P4

P3

P0

MULTI-FPGA PARTITIONING


Timing results for atr on our wildforce

TIMING RESULTS FOR atr ON OUR WILDFORCE

  • OUR WILDFORCE ACS IS 156X FASTER THAN KHOROS/CPU NOW.

  • IF WE HAVE SUFFICIENT LOGIC AND MEMORY SUCH THAT NO RECONFIGURATIONS ARE NEEDED, THE ACS COULD BE 667X FASTER.

  • IF FULLY PIPELINED, THE ACS COULD BE 32,000X FASTER.

Data Processing

33

Data Transfer

34

Host Code

1544

Reconfiguration

5159

0

1000

2000

3000

4000

5000

6000


Partitioning 1st board configuration phase

PARTITIONING - 1st BOARD CONFIGURATION PHASE

Blank Frame Map

Compute Edge Stats

Find First Target Pixel

RAM

RAM

Mask Target Pixels

Input Image

Check Intensity Stats

Mark Frame Pixels

11

11

4

4

4

554

500

PE1

PE3

Low-Pass Filter

RAM

AND

Sobel Filter

Compute Intensity Stats

Check Edge Stats

Write to RAM - A

11

11

Low-Pass Filter

Check >= 4

11

1296

Low-Pass Filter

Check >= 4

CPE0

Mask Invalid Target Region

72

4

548

PE2

PE4


Partitioning 2nd board configuration phase

PARTITIONING - 2nd BOARD CONFIGURATION PHASE

Find First Target Pixel

Find First Target Pixel

RAM

RAM

Mask Target Pixels

Mask Target Pixels

Mark Frame Pixels

Mark Frame Pixels

4

4

4

4

500

500

PE1

PE3

Read from RAM - A

5

Find First Target Pixel

RAM

Write to RAM - B

53

Mask Target Pixels

CPE0

Mark Frame Pixels

4

4

72

500

PE2

PE4


Partitioning 3rd board configuration phase

PARTITIONING - 3rd BOARD CONFIGURATION PHASE

Find First Target Pixel

RAM

Mask Target Pixels

Mark Frame Pixels

4

4

0

500

PE1

PE3

Read from RAM - B

5

Find First Target Pixel

RAM

Write to RAM - C

53

Mask Target Pixels

CPE0

Mark Frame Pixels

4

4

72

500

PE2

PE4


Partitioning 4th board configuration phase

PARTITIONING - 4th BOARD CONFIGURATION PHASE

RAM

Read from RAM - C

Find Max Intensity

Combine Image and Frames

4

11

11

11

119

75

PE1

PE3

Input Image

11

Output Image

RAM

53

CPE0

4

11

11

72

90

PE2

PE4


Productivity improvement is 100x 250 hours manually vs 2 5 hours automatically

PRODUCTIVITY IMPROVEMENT IS 100X(250 hours manually vs. 2.5 hours automatically)

Application

Khoros

Partitioning

Suite

Data Matching

Data Synchronization

WSP2NETLIST

NETLIST2STV

Synthesis/Place & Route

ACS

Automatic

Manual

time


Lessons learned

LESSONS LEARNED

  • Learned that the translation from KHOROS to hardware is complicated by several factors including:

    • Differences in the way blocks of data are passed from operator to operator.

    • Parameters for data bit-widths must be specified for each cell.

    • Difference between data-driven KHOROS cells and clock-driven hardware cells creates a need for data synchronization.

  • Determined that reconfiguration time was the major obstacle to achieving high performance, and that RAM access conflicts required more reconfigurations than would be otherwise necessary.

  • Learned that manual implementation of KHOROS applications on WildForce is very time-consuming and tedious (250 hours).

  • Thus, great potential exists for making a significant (100x) improvement on productivity via automation.


Schedule and milestones

SCHEDULE AND MILESTONES

May 98Demonstrated the manual mapping of a simple KHOROS network on a Xilinx-based ACS (EVC-1). We also validated our method for library development at the KHOROS, VHDL and FPGA levels.

Sep 98Demonstrated the manual mapping of a more complex KHOROS network on a Xilinx-based ACS (Wildforce).

Mar 99Demonstrated the manual mapping of a complex KHOROS network with some automated FPGA partitioning on the Wildforce.

Sep 99Automated additional portions of the application development flow.

Jan 00Will demonstrate the Army Night Vision Lab challenge problem with automatic mapping onto the Wildforce.

Mar 00Will demonstrate two additional challenge problems (e.g. Face Detection and Image Backprojection on the Wildforce).

Sep 00Will demonstrate all three challenge problems on two additional ACS platforms (e.g. Altera-based ACS and latest Xilinx-Virtex ACS).


Champion a software design environment for adaptive computing systems

  • KEY IDEAS

  • Khoros-based designs will be automatically linked to Adaptive Computing Systems.

  • Mapping onto ACS will be accelerated using precompiled library primitives (semicustom approach).

  • Metrics and visualization will reveal to application designer low-level details as warranted.

  • Application designer will guide the partitioning of tasks between hardware and software.

SCHEDULEMILESTONES

May 98Demonstrated the manual mapping of a simple KHOROS network on a Xilinx-based ACS (EVC-1). We also validated our method for library development at the KHOROS, VHDL and FPGA levels.

Sep 98Demonstrated the manual mapping of a more complex KHOROS network on a Xilinx-based ACS (Annapolis Microsystems Wildforce).

Mar 99Demonstrated the manual mapping of a complex KHOROS network with some automated FPGA partitioning on the Wildforce.

Sep 99Automated additional portions of the application development flow.

Jan 00Will demonstrate the Army Night Vision Lab challenge problem withautomatic mapping onto the Wildforce.

Mar 00Will demonstrate two additional challenge problems (e.g. Face Detection and Image Backprojection on the Wildforce).

Sep 00Will demonstrate all three challenge problems on two additional ACS platforms (e.g. Altera-based ACS and latest Xilinx-Virtex ACS).

GOAL: Automate the mapping of Khoros-based applications onto adaptive computing systems to improve designer productivity by 100x.

IMPACT: More application designers will be able to achieve higher quality implementations in less time. Adaptive computing systems will be utilized more effectively and by a wider audience.

CHAMPION: A SOFTWARE DESIGN ENVIRONMENT FOR ADAPTIVE COMPUTING SYSTEMS


  • Login