Embedded processor architecture
Download
1 / 36

Embedded Processor Architecture - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

Embedded Processor Architecture. 5kk73. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. x4. x3. x2. x1. x0. Z -1. Z -1. Z -1. Z -1. c 4. c 3. c 2.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Embedded Processor Architecture' - alexis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Embedded processor architecture

Embedded Processor Architecture

5kk73


flexibility

efficiency

DSP

Programmable

CPU

Programmable

DSP

Application specific

instruction set

processor (ASIP)

Application

specific processor

Embedded Processor Architecture Henk Corporaal / Bart Mesman


x4

x3

x2

x1

x0

Z-1

Z-1

Z-1

Z-1

c4

c3

c2

c1

c0

*

*

*

*

*

+

y

Application examples (1)

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Application examples (1)

19 instructions per tap!!

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Application examples (2)

Bit level operations:

finite field arithmetic

10 instructions!!

Very simple in hardware

Embedded Processor Architecture Henk Corporaal / Bart Mesman


source register ($2)

27

26

25

23

22

20

srl $13, $2, 20

andi $25, $13, 1

srl $14, $2, 21

andi $24, $14, 6

or $15, $25, $24

srl $13, $2, 22

andi $14, $13, 56

or $25, $15, $14

sll $24, $25, 2

7

6

5

4

3

2

destination register ($24)

Application examples (2)

Bit level operations : DES example

Embedded Processor Architecture Henk Corporaal / Bart Mesman


18

17

16

13

$5

srl

$24, $5, 18

srl

$25, $5, 17

xor

$8, $24, $25

srl

$9, $5, 16

xor

xor

$10, $8, $9

srl

$11, $5, 13

xor

$12, $10, $11

andi

$13, $12, 1

… 0 ...

1

$13

Application examples (2)

Bit level operations : A5 example (GSM encryption)

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Application examples conclusions
Application examples: conclusions

  • CPUs offer flexibility, but…

  • not efficient in performance

  • not efficient in code size

  • not efficient in power consumption

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Power consumption in microprocessors
Power Consumption in microprocessors

Power consumption is (becoming) the limiting factor in processor design

Solution in direction of

  • Hardware acceleration

  • Instruction Level Parallelism instead of clock speed

  • Code size efficiency

source: ISSCC2001, Patrick Gelsinger, Intel

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Amdahl s law
Amdahl’s law

  • Impact of an improvement on the execution time of a program depends on 2 parameters:

    • f = fraction of the original computation time that is affected by the improvement

    • s = speedup factor (local)

  • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s

  • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)

  • if s >> 1 then speedup_overall = 1 / ( 1 – f )

  • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Conclusions

  • Programmable CPU cores are important for the control parts of the application.

  • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)

  • Keep it Simple heuristic (RISC vs. CISC)

    • Make frequent cases fast and rare cases correct.

    • Regular (orthogonal) instruction set

    • No special features that match a high level language construct.

    • At least 16 registers to ease register allocation.

  • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Programmable digital signal processors

CPI = 1

Programmable Digital Signal Processors

  • instruction level parallelism (ILP)

  • hardware support for loop control

  • attention for high level data types e.g. arrays, delaylines

    • (vs. scalars for CPUs)

  • difficult to compare architectures

    • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling,

    • shuffling, intialisation … can be included or forgotten

    • benchmarking (Berkeley Design Technology Inc (BDTi))

    • (compare to SpecInt benchmarks for CPs)

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Outline power

  • architectures for programmable DSPs

    • multiplier-accumulator

    • modified Harvard architecture

    • extension with an ALU (decision making)

    • controller architectures

  • examples: TI, Motorola, Philips

  • code generation

  • developments: VLIW (Very Long Instruction Word)

    • examples: C6 and TM

Embedded Processor Architecture Henk Corporaal / Bart Mesman


DSP data types power

  • not every signal requires 32 bits

  • 2 types of DSP: floating point and integer

  • advantages FP: most specs are in FP

    • (conversion to int is time consuming since the behavior

    • may change)

  • disadvantage FP: cost (area, speed, power)

  • integer multiplication doubles the number of bits: n * n => 2n

Embedded Processor Architecture Henk Corporaal / Bart Mesman


c(i) power

x(i)

control

MPY

(Booth,

Wallace..)

clock

P_reg

PR

ADDER

clock

ACR

P_reg

SHIFT

ROUND

TRUNCATE

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Prog/data power

memory

prog

mem.

data

mem.

prog

mem.

data

mem. 1

data

mem. 2

EXU

EXU

EXU

Harvard

Modified Harvard

Von Neumann

(sequential)

 c(i) * x(i)

Goal = 1 cycle per iteration

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Interrupt power

address

Reset

ACU_A

ACU_B

AR_A

AR_B

Stack

+1

PC

Program

Memory

DR_A

DR_B

IR

Control Bus

Rfile

RAM_A

RAM_B

MAC

Embedded Processor Architecture Henk Corporaal / Bart Mesman


time loop power

1 cycle/tap ?

 ci * xi

filter loop i

How updating the delayline ?

x5

x4

x3

x2

x1

Z-1

Z-1

Z-1

Z-1

c5

c4

c3

c2

c1

*

*

*

*

*

+

y

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Solution 2: indirect adressing power

  • use of a pointer to mark the begin of the delay line

  • problem: trashing of the whole memory

  • solution: modulo addressing

  • need for a register to store the pointer

Embedded Processor Architecture Henk Corporaal / Bart Mesman


ACU architecture and power

Instruction set

A

S

Output reg A reg S

Read_A A A S

Read_S S A S

incA A+1 A+1 S

decA A-1 A-1 S

Step A+S A+S S

Inc_step S+1 A S+1

Modulo

16 10 000

23 10 111

mask

=hold

Modulo can be

implemented as a

mask operation

if the size is 2k

output

to RAM

Embedded Processor Architecture Henk Corporaal / Bart Mesman


Addressing modes power

  • register ADD R4, R3 R[R4] = R[R4] + R[R3]

  • immediate ADD R4, #3 R[R4] = R[R4] + #3

  • direct ADD R4, (100) R[R4] = R[R4] + Mem[100]

  • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

    • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]]

      • R[R3] = R[R3] ± 1

  • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]]

    • R[R3] = R[R3] ± R[R2]

    • Remarks

    • direct = for static data

    • indirect = for arrays

      • inc/dec = for stepping through arrays e.g.  xn

      • index = for stepping through arrays e.g.  x2n

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Addressing modes: extra for DSP power

    • 8 ARs (address or auxiliary register) available

    • extra indirect modes

      • circular *ARn ± % post inc/dec by 1 - circular

        • *ARn ± AR0 % post inc/dec by AR0 - circular

    • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Interrupt power

    address

    Reset

    ACU_A

    ACU_B

    AR_A

    AR_B

    Stack

    +1

    PC

    RAM_A

    RAM_B

    Program

    Memory

    DR_A

    DR_B

    IR

    MAC

    ALU

    Control Bus

    Rfile

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    first solution power

     c(i) * x(i)

    Not shown

    coefficient RAM+ACU

    resources

    6 clockcycles/sample

    limit pipelines in the controller

    time (cc)

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    a poweri

    a0

    f

    f

    f

    f

    f

    for i = 0 to n

    bi = f(ai)

    ci = g(bi)

    di = h(ci)

    a1

    g

    g

    g

    g

    g

    bi

    b0

    ci-2

    bi-1

    ai

    for i = 2 to n

    bi = f(ai)

    ci-1 = g(bi-1)

    di-2 = h(ci-2)

    a2

    h

    h

    h

    h

    h

    ci

    c0

    b1

    c1

    di

    d0

    b2

    di-2

    ci-1

    bi

    c2

    d1

    d2

    Loopfolding (software pipelining)

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Loopfolding (software pipelining) power

     c(i) * x(i)

    Pre- and postamble

    4 clockcycles /sample

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    hardware support for loop control power

     c(i) * x(i)

    1 clockcycles/sample

    repeat instruction and repeat block

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    TMS320C5000 power

    T register

    E

    D

    P

    C

    D

    T

    B

    A

    T

    C

    B

    A

    C

    D

    A

    D

    Sign ctr

    Sign ctr

    A(40)

    B(40)

    Sign ctr

    Sign ctr

    Sign ctr

    Multiplier (17*17)

    MUX

    A

    ALU (40)

    M

    U

    A

    B

    B

    0

    A

    B

    Barrer shifter

    fractional

    MUX

    MUX

    COMP

    Adder (40)

    MSW/LSW

    select

    TRN

    ZERO

    SAT

    ROUND

    TC

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Address bus power

    16 bits

    Motorola 56K family

    EXTERNAL

    ADRESS SWITCH

    P Address

    Y Address

    X Address

    2,048-by-24-bit

    PROGRAM

    MEMORY

    ROM

    X memory

    256-by-24-bit

    RAM

    256-by-24-bit

    ROM

    Address

    ALU

    Y memory

    256-by-24-bit

    RAM

    256-by-24-bit

    ROM

    INTERNAL

    DATA-BUS

    SWITCH

    X-DATA

    EXTERNAL

    DATA-BUS

    SWITCH

    24 BITS

    Y DATA

    DATA

    BUS

    P DATA

    GLOBAL DATA

    ON CHIP

    PERIPHERALS,

    HOST,

    SYNCHRONOUS

    SERIAL INTERFACE

    SERIAL COMMU-

    NICATIONS

    INTERFACE,

    PROGRAMMED I/O,

    BUS CONTROL

    DATA ALU

    24-by-24 bit

    MULTIPLIER-

    ACCUMULATOR

    PRODUCING

    56 BIT RESULT

    24 BITS

    PROGRAM CONTROLLER

    I/O

    PORTS

    2 BITS

    3 BITS

    7 BITS

    CLOCK

    INTERRUPT

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    X power

    Y

    Two address

    Compution

    units

    X data

    memory

    Y data

    memory

    16 bit

    bus

    16 bit

    bus

    Two 16-by-16 bit

    multipliers

    Program

    control

    unit

    Y0

    Y0

    Y1

    Y1

    X

    X

    PO

    P1

    scale

    scale

    96-bit instructions

    Program

    memory

    (Z data)

    Instruction decoder

    Two 40 bit

    arithmic-

    logic units

    shift

    Saturation

    Saturation

    Four 40 bit

    accumulators

    16-bit

    bus

    Saturation/scale

    R.E.A.L.

    X data

    Buses for

    Y data

    Z data

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    source power

    lexical analysis

    syntax analysis

    Front end

    semantic analysis

    Intermediate

    machine independent

    representation

    Code selection

    Register allocation

    Code generation

    scheduling

    1 instr = // ops

    order of instr

    code

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Intermediate power

    machine independent

    representation

    BBi

    BBk

    BBj

    a

    b

    c

    d

    *

    c

    +

    t1 := a * b

    t2 := c + d

    t3 := t1 + c

    out := t2 * t3

    t1

    t2

    +

    t3

    *

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Code selection example power

    d memory

    p memory

    ADSP

    [Analog Devices]

    ax

    ay

    af

    mx

    my

    mf

    x

    y

    x

    y

    + -

    *

    MAC

    ALU

    + -

    ar

    mr

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    Example of code selection power

    = covering of intermediate representation with RTPs

    mx := dmem

    my := pmem

    ax := dmem

    ay := pmem

    a

    b

    c

    d

    mr := dmem

    *

    c

    +

    ar := ax + ay

    3:

    t1

    t2

    +

    2:

    Mr := mr + (mx * my)

    t3

    *

    1:

    my := ar

    mr = mr * my

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    • Problems power

    • local decisions which have a global impact

    • phase coupling: example

      • asap schedule

      • maximal freedom for scheduling

      • code selection during scheduling

      • register allocation comes afterwards

      • can lead to infeasible solutions

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    phase coupling: discussion power

    It is very difficult and almost impossible to develop robust and

    efficient DSP compilers.

    Current DSP practice = programming in assembler

    Solution:

    1. Solve code generation for DSPs

    2. Step back and rethink the architecture

    develop an architecture which is still efficient but also

    a good model for building a compiler

    Efficiency = exploit instruction level parallelism (ILP)

    compilation = systematic positioning of registers and regular

    interconnect

    = VLIW = Very Long Instruction Word

    Embedded Processor Architecture Henk Corporaal / Bart Mesman


    ad