cpre coms 583 reconfigurable computing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CprE / ComS 583 Reconfigurable Computing PowerPoint Presentation
Download Presentation
CprE / ComS 583 Reconfigurable Computing

Loading in 2 Seconds...

play fullscreen
1 / 46

CprE / ComS 583 Reconfigurable Computing - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

CprE / ComS 583 Reconfigurable Computing. Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW Codesign. Quick Points. Midterm graded and returned Average – 84.5 Median – 85.0 Maximum – 95.0 Minimum – 72.0

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CprE / ComS 583 Reconfigurable Computing' - hakeem-ingram


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cpre coms 583 reconfigurable computing

CprE / ComS 583Reconfigurable Computing

Prof. Joseph Zambreno

Department of Electrical and Computer Engineering

Iowa State University

Lecture #21 – HW/SW Codesign

quick points
Quick Points
  • Midterm graded and returned
    • Average – 84.5
    • Median – 85.0
    • Maximum – 95.0
    • Minimum – 72.0
    • Standard Deviation – 6.65

CprE 583 – Reconfigurable Computing

hw 4 discussion
HW #4 Discussion
  • Problem 1 – did just a simple adder work?
  • Problem 2 – how did you implement the permutation table?
  • Problem 3 – did you use a counter?

CprE 583 – Reconfigurable Computing

aes 128e algorithm
AES-128E Algorithm

Round Transformation

round++

KeyExpansion

128-bit

key

ShiftRows

MixColumns

AddRoundKey

SubBytes

128-bit

plaintext

No

round = 10?

Yes

128-bit

ciphertext

CprE 583 – Reconfigurable Computing

overview of aes cont

51

ca

0

b7

d0

09

ba

e1

cd

60

e7

53

04

63

70

8c

e0

7c

c8

25

98

1

83

32

89

b5

c7

ef

a3

81

d1

0c

82

fd

23

37

77

aa

4f

3a

11

13

40

c9

00

b5

2c

2

93

0d

25

fb

1a

6d

ec

0d

dc

7d

7b

2e

8f

c3

3

0a

66

26

11

ed

a6

d5

06

82

43

f2

2a

fa

40

4

20

18

1b

56

5f

36

92

09

00

fa

6b

5

42

3f

43

1b

af

2c

12

4a

c7

4a

cc

6f

1b

dd

20

6

41

b5

46

ed

6f

21

34

ba

25

12

25

3a

35

4a

25

a6

7

81

c5

99

45

a6

60

8c

b5

8c

dc

3a

bb

af

97

e1

6d

46

fa

20

23

c5

4f

30

55

43

4f

b5

8

b5

fc

0d

9

43

43

c3

e1

00

09

aa

f2

95

63

b5

01

a6

12

ee

44

ee

88

ff

c7

a

cb

36

aa

25

dc

1b

17

8c

c3

67

b5

33

5a

0c

4a

bc

1e

6d

ef

57

4f

ed

2b

2a

0d

b

b5

60

a6

0d

60

51

a6

8e

00

4a

67

33

fe

69

6f

19

04

c

bc

48

b5

d

13

00

3b

84

d7

25

e1

1f

4f

fd

25

2a

fa

3a

ba

15

1b

89

0f

43

77

ed

00

ab

60

e

09

b5

60

6f

8c

09

ba

dc

aa

f

dc

4a

73

49

20

76

51

4f

a6

4a

22

5d

e

c

8

a

2

0

6

4

9

f

1

3

5

b

7

d

x

y

Overview of AES (cont.)
  • 128-bit input is copied into a two-dimensional (4x4) byte array referred to as the state
    • Round transformations operate on the state array
    • Final state copied back into 128-bit output
  • AES makes use of a non-linear substitution function that operates on a single byte
    • Can be simplified as a look-up table (S-box)

S-box

CprE 583 – Reconfigurable Computing

aes 128e modules subbytes

Round Transformation

round++

KeyExpansion

128-bit

key

ShiftRows

MixColumns

S'0,0

S'0,1

S'0,2

S'0,3

S0,0

S0,1

S0,2

S0,3

S'1,0

S'1,2

S'1,3

S1,0

S1,2

S1,3

S'r,c

Sr,c

AddRoundKey

SubBytes

128-bit

plaintext

S'2,0

S'2,1

S'2,2

S'2,3

S2,0

S2,1

S2,2

S2,3

S'3,0

S'3,1

S'3,2

S'3,3

S3,0

S3,1

S3,2

S3,3

No

round = 10?

Yes

128-bit

ciphertext

AES-128E Modules: SubBytes

SubBytes

  • S-box transformation performed independently on each byte of the state

S-box

state[i]

state'[i]

CprE 583 – Reconfigurable Computing

aes 128e modules shiftrows

Round Transformation

round++

KeyExpansion

128-bit

key

ShiftRows

MixColumns

AddRoundKey

SubBytes

128-bit

plaintext

No

round = 10?

Yes

128-bit

ciphertext

AES-128E Modules: ShiftRows

ShiftRows

  • Bytes in the last three rows of the state are shifted cyclically over variable offsets

S0,0

S0,1

S0,2

S0,3

S'0,0

S'0,1

S'0,2

S'0,3

S1,0

S1,1

S1,2

S1,3

S'1,1

S'1,2

S'1,3

S'1,0

state[i]

state'[i]

S2,0

S2,1

S2,2

S2,3

S'2,2

S'2,3

S'2,0

S'2,1

S3,0

S3,1

S3,2

S3,3

S'3,3

S'3,0

S'3,1

S'3,2

CprE 583 – Reconfigurable Computing

aes 128e modules mixcolumns

Round Transformation

round++

KeyExpansion

128-bit

key

ShiftRows

MixColumns

AddRoundKey

SubBytes

128-bit

plaintext

No

round = 10?

S'0,0

S0,0

S0,2

S'0,2

S0,3

S'0,3

S'0,1

S0,1

S'1,0

S1,0

S'1,2

S1,2

S1,3

S'1,3

S1,1

S'1,1

Yes

S2,0

S'2,0

S2,2

S'2,2

S'2,3

S2,3

S2,1

S'2,1

128-bit

ciphertext

S'3,0

S3,0

S'3,2

S3,2

S'3,3

S3,3

S'3,1

S3,1

AES-128E Modules: MixColumns

MixColumns

  • Modulo polynomial-basis multiplication performed on each column of the state
  • Can be simplified as series of AND and XOR operations

{03h}

state[i]

state'[i]

{02h}

CprE 583 – Reconfigurable Computing

mixcolumns implementation
MixColumns Implementation

-- Multiply by 2

t1 := STATE_IN(i mod 4)(j)(6 downto 0) & '0';

if (STATE_IN(i mod 4)(j)(7) = '1') then

t1 := t1 xor x"1b";

end if;

-- Multiply by 3

t2 := STATE_IN((i+1) mod 4)(j)(6 downto 0) & '0';

if (STATE_IN((i+1) mod 4)(j)(7) = '1') then

t2 := t2 xor x"1b";

end if;

t2 := t2 xor STATE_IN((i+1) mod 4)(j);

tSTATE(i)(j) <= t1 xor t2 xor STATE_IN((i+2) mod 4)(j) xor STATE_IN((i+3) mod 4)(j);

end loop;

end loop;

end process;

entity MixColumns is

port (STATE_IN : in STATEtype;

RNUM_IN : in RNUMtype;

STATE_OUT : out STATEtype);

end MixColumns;

architecture behavior of MixColumns is

signal tSTATE : STATEtype;

begin

process(STATE_IN)

variable t1, t2 : std_logic_vector(7 downto 0);

begin

for i in 0 to 3 loop

for j in 0 to Nb-1 loop

CprE 583 – Reconfigurable Computing

aes 128e modules addroundkey

Round Transformation

round++

KeyExpansion

128-bit

key

ShiftRows

MixColumns

AddRoundKey

SubBytes

128-bit

plaintext

No

round = 10?

Yes

128-bit

ciphertext

AES-128E Modules: AddRoundKey

AddRoundKey

  • Words from the round-specific key are XORed into columns of the state

S0,0

S0,2

S0,3

S'0,0

S'0,2

S'0,3

S0,1

S'0,1

S1,0

S1,2

S1,3

S'1,0

S'1,2

S'1,3

S1,1

S'1,1

state[i]

state'[i]

S2,0

S2,2

S2,3

S'2,0

S'2,2

S'2,3

S2,1

S'2,1

S3,0

S3,2

S3,3

S'3,0

S'3,2

S'3,3

S3,1

S'3,1

w[1]

w[0]

w[2]

w[3]

Rkey[i]

CprE 583 – Reconfigurable Computing

addroundkey implementation
AddRoundKey Implementation

entity AddRoundKey is

port(STATE_IN : in STATEtype;

KEY_IN : in KEYtype;

STATE_OUT : out STATEtype);

end AddRoundKey;

architecture behavior of AddRoundKey is

begin

process(STATE_IN, KEY_IN)

begin

for j in 0 to (Nb-1) loop

STATE_OUT(0)(j) <= STATE_IN(0)(j) xor KEY_IN(j)(31 downto 24);

STATE_OUT(1)(j) <= STATE_IN(1)(j) xor KEY_IN(j)(23 downto 16);

STATE_OUT(2)(j) <= STATE_IN(2)(j) xor KEY_IN(j)(15 downto 8);

STATE_OUT(3)(j) <= STATE_IN(3)(j) xor KEY_IN(j)(7 downto 0);

end loop;

end process;

end behavior;

CprE 583 – Reconfigurable Computing

aes 128e modules keyexpansion

Round Transformation

round++

KeyExpansion

128-bit

key

ShiftRows

MixColumns

AddRoundKey

SubBytes

128-bit

plaintext

No

round = 10?

Yes

128-bit

ciphertext

AES-128E Modules: KeyExpansion

KeyExpansion

  • Initial 128-bit key is converted into separate keys for each of the 10 required rounds
  • Consists of Sbox transformations and some XORs

Rkey[1]

Rkey[2]

w[0]

S

w[4]

Rkey[3]

Rkey[4]

w[1]

S

w[5]

128-bit

key

Rkey[5]

Rkey[6]

w[2]

S

w[6]

Rkey[7]

w[3]

S

w[7]

Rkey[8]

Rkey[9]

Rkey[10]

rcon

CprE 583 – Reconfigurable Computing

design decisions
Design Decisions
  • Online/offline key generation
  • Inter-round layout decisions
    • Round unrolling
    • Round pipelining
  • Intra-round layout decisions
    • Transformation pipelining
    • Transformation partitioning
  • Technology mapping decisions
    • S-box synthesis as Block SelectRAM, distributed ROM primitives, or logic gates

CprE 583 – Reconfigurable Computing

round unrolling pipelining
Round Unrolling / Pipelining
  • Unrolling replaces a loop body (round) with N copies of that loop body
  • AES-128E algorithm is a loop that iterates 10 times – N є [1, 10]
    • N = 1 corresponds to original looping case
    • N = 10 is a fully unrolled implementation
  • Pipelining is a technique that increases the number of blocks of data that can be processed concurrently
    • Pipelining in hardware can be implemented by inserting registers
    • Unrolled rounds can be split into a certain number of pipeline stages
  • These transformations will increase throughput but increase area and latency

CprE 583 – Reconfigurable Computing

round unrolling pipelining cont
Round Unrolling / Pipelining (cont.)

Unrolling factor = 10

Unrolling factor = 2

Unrolling factor = 1

Unrolling factor = 5

Round pipelining = ON

R1

R2

R3

R4

R5

Input

plaintext

Output

Ciphertext

R10

R9

R8

R7

R6

CprE 583 – Reconfigurable Computing

transformation partitioning pipelining
Transformation Partitioning/Pipelining
  • FPGA maximum clock frequency depends on critical logic path
    • Inter-round transformations can’t improve critical path
    • Individual transformations can be pipelined with registers similar to the rounds
  • Transformations that are part of the maximum delay path can be partitioned and pipelined as well
  • Can result in large gains in throughput with only minimal area increases

CprE 583 – Reconfigurable Computing

partitioning pipelining cont
Partitioning / Pipelining (cont.)

Transformation pipelining = ON

Transformation partitioning = ON

SubBytes

ShiftRows

MixColumns

AddRoundKey

KeyExpansionA

KeyExpansionB

KeyExpansion

KeyExpansionC

CprE 583 – Reconfigurable Computing

s box technology mapping
S-box Technology Mapping
  • With synthesis primitives, can map the S-box lookup tables to different hardware components
  • Two S-boxes can fit on a single Block SelectRAM

constantSSYNROMSTYLE: string:= “select_rom”; -- {logic, select_rom}

entity Sboxis

port(BYTE_IN: in std_logic_vector(7 downto 0);

BYTE_OUT: out std_logic_vector(7 downto 0));

attribute syn_romstyle:string;

attribute syn_romstyle of BYTE_OUT: signal is SSYNROMSTYLE;

end Sbox;

...

Sample VHDL code

CprE 583 – Reconfigurable Computing

recap retiming
Recap – Retiming

CprE 583 – Reconfigurable Computing

recap retiming cont
Recap – Retiming (cont.)

weight(e) = weight(e) + lag(head(e)) - lag(tail(e))

CprE 583 – Reconfigurable Computing

retiming and pipelining
Retiming and Pipelining
  • Can use this retiming to pipeline
  • Assume have enough (infinite supply) of registers at edge of circuit
  • Retime them into circuit
  • See [WeaMar03A] for details

CprE 583 – Reconfigurable Computing

recap retiming and covering
Recap – Retiming and Covering

CprE 583 – Reconfigurable Computing

outline
Outline
  • HW #4 Discussion
  • Recap
  • HW/SW Codesign
    • Motivation
    • Specification
    • Partitioning
    • Automation

CprE 583 – Reconfigurable Computing

hardware software codesign
Hardware/Software Codesign
  • Definition 1 – the concurrent and co-operative design of hardware and software components of an embedded system
  • Definition 2 – A design methodology supporting the cooperative and concurrent development of hardware and software (co-specification, co-development, and co-verification) in order to achieve shared functionality and performance goals for a combined system [MicGup97A]

CprE 583 – Reconfigurable Computing

motivation
Motivation
  • Not possible to put everything in hardware due to limited resources
  • Some code more appropriate for sequential implementation
  • Desirable to allow for parallelization, serialization
  • Possible to modify existing compilers to perform the task

CprE 583 – Reconfigurable Computing

why put cpus on fpgas
Why put CPUs on FPGAs?
  • Shrink a board to a chip
  • What CPUs do best:
    • Irregular code
    • Code that takes advantage of a highly optimized datapath
  • What FPGAs do best:
    • Data-oriented computations
    • Computations with local control

CprE 583 – Reconfigurable Computing

computational model

Memory

FPGA

Computational Model
  • Most recent work addressing this problem assumes relatively slow bus interface
  • FPGA has direct interface to memory in this model

Memory

bus

General-

Purpose

Processor

CprE 583 – Reconfigurable Computing

hardware software partitioning
Hardware/Software Partitioning

if (foo < 8) {

for (i=0; i<N; i++)

x[i] = y[i]*z[i];

}

CPU

HW

Accelerator

CprE 583 – Reconfigurable Computing

methodology
Methodology
  • Separation between function, and communication
  • Unified refinable formal specification model
    • Facilitates system specification
    • Implementation independent
    • Eases HW/SW trade-off evaluation and partitioning
  • From a more practical perspective:
    • Measure the application
    • Identify what to put onto the accelerator
    • Build interfaces

CprE 583 – Reconfigurable Computing

system level methodology
System-Level Methodology

Informal Specification,

Constraints

Component

profiling

System model

Performance

evaluation

Architecture design

HW/SW implementation

Fail

Test

Prototype

Success

Implementation

CprE 583 – Reconfigurable Computing

concurrency
Concurrency
  • Concurrent applications provide the most speedup

No data dependencies

CPU

if (a > b) ...

x[i] = y[i] * z[i]

accelerator

CprE 583 – Reconfigurable Computing

partitioning

Process 1

Process 2

Process 3

Partitioning
  • Can divide the application into several processes that run concurrently
  • Process partitioning exposes opportunities for parallelism

if (i>b) …

for (i=0; i<N; i++) …

for (j=0; j<N; j++) ...

CprE 583 – Reconfigurable Computing

automating system partitioning

process (a, b, c)

in port a, b;

out port c;

{

read(a);

write(c);

}

Specification

Automating System Partitioning

Line ()

{

a = …

detach

}

Interface

  • Good partitioning mechanism:
    • Minimize communication across bus
    • Allows parallelism  both hardware (FPGA) and processor operating concurrently
    • Near peak processor utilization at all times (performing useful work)

Partition

Model

FPGA

Capture

Synthesize

Processor

CprE 583 – Reconfigurable Computing

partitioning algorithms
Partitioning Algorithms

Software

Hardware

  • Assume everything initially in software
  • Select task for swapping
  • Migrate to hardware and evaluate cost
    • Timing, hardware resources, program and data storage, synchronization overhead
  • Cost evaluation and move evaluation similar to what we’ve seen regarding mincut and simulated annealing

task

List of tasks

List of tasks

CprE 583 – Reconfigurable Computing

multi threaded systems
Multi-threaded Systems
  • Single thread:
  • Multi-thread:

CprE 583 – Reconfigurable Computing

performance analysis
Single threaded:

Find longest possible execution path

Multi-threaded with no synchronization:

Find the longest of several execution paths

Multi-threaded with synchronization:

Find the worst-case synchronization conditions

Performance Analysis

CprE 583 – Reconfigurable Computing

multi threaded performance analysis
Multi-threaded Performance Analysis
  • Synchronization causes the delay along one path to affect the delay along another

ta

tb

synchronization point

tc

td

Delay = max(ta, tb) + td

CprE 583 – Reconfigurable Computing

control
Control
  • Need to signal between CPU and accelerator
    • Data ready
    • Complete
  • Implementations:
    • Shared memory
    • Handshake
  • If computation time is very predictable, a simpler communication scheme may be possible

CprE 583 – Reconfigurable Computing

communication levels

Application

Program

Operating

System

I/O driver

I/O bus

Communication Levels

Send, Receive, Wait

  • Easier to program at application level
    • (send, receive, wait) but difficult to predict
  • More difficult to specify at low level
    • Difficult to extract from program but timing and resources easier to predict

Application

hardware

(custom)

Register reads/writes

I/O driver

Interrupt service

Bus transactions

I/O bus

Interrupts

CprE 583 – Reconfigurable Computing

other interface models
Other Interface Models
  • Synchronization through a FIFO
  • FIFO can be implemented either in hardware or in software
  • Effectively reconfigure hardware (FPGA) to allocate buffer space as needed
  • Interrupts used for software version of FIFO

r3

p1

p2

p3

r2

d1

FPGA

Control/Data FIFO

d3

d2

CprE 583 – Reconfigurable Computing

debugging
Debugging
  • Hard to test a CPU/accelerator system:
    • Hard to control and observe the accelerator without the CPU
    • Software on CPU may have bugs
  • Build separate test benches for CPU code, accelerator
  • Test integrated system after components have been tested

CprE 583 – Reconfigurable Computing

polis codesign methodology

Formal

Verification

Compilers

Partitioning

Simulation

POLIS Codesign Methodology

................

ESTEREL

Graphical EFSM

CFSMs

Sw Synthesis

Hw Synthesis

Intfc + RTOS

Synthesis

Sw Code +

RTOS

Logic Netlist

Rapid prototyping

CprE 583 – Reconfigurable Computing

codesign finite state machines
Codesign Finite State Machines
  • POLIS uses an FSM model for
    • Uncommitted
    • Synthesizable
    • Verifiable

Control-dominated HW/SW specification

  • Translators from
    • State diagrams,
    • Esterel, ECL, ReactiveJava
    • HDLs

Into a single FSM-based language

CprE 583 – Reconfigurable Computing

cfsm behavior
CFSM behavior
  • Four-phase cycle:
    • Idle
    • Detect input events
    • Execute one transition
    • Emit output events
  • Software response could take a long time:
    • Unbounded delay assumption
  • Need efficient hw/sw communication primitive:
    • Event-based point-to-point communication

CprE 583 – Reconfigurable Computing

network of cfsms
Network of CFSMs
  • Globally Asynchronous, Locally Synchronous (GALS) model

F

B=>C

C=>F

G

C=>G

C=>G

F^(G==1)

C

C=>A

CFSM2

CFSM1

CFSM1

CFSM2

C

A

C=>B

B

C=>B

(A==0)=>B

CFSM3

CprE 583 – Reconfigurable Computing

summary
Summary
  • Hardware/software codesign complicated and limited by performance estimates
  • Algorithms not generally as good as human partitioning
  • Other interesting issues include dual processors, special memory interfaces
  • Will likely evolve at faster rate as compilers evolve

CprE 583 – Reconfigurable Computing