Precision timed embedded systems using tickpad memory
This presentation is the property of its rightful owner.
Sponsored Links
1 / 81

Precision Timed Embedded Systems Using TickPAD Memory PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Precision Timed Embedded Systems Using TickPAD Memory. Matthew M Y Kuo* Partha S Roop* Sidharta Andalam † Nitish Patel* *University of Auckland, New Zealand † TUM CREATE, Singapore. Introduction. Hard real time systems Need to meet real time deadlines

Download Presentation

Precision Timed Embedded Systems Using TickPAD Memory

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Precision timed embedded systems using tickpad memory

Precision Timed Embedded Systems Using TickPAD Memory

Matthew M Y Kuo*

Partha S Roop*

Sidharta Andalam†

Nitish Patel*

*University of Auckland, New Zealand

†TUM CREATE, Singapore


Introduction

Introduction

  • Hard real time systems

    • Need to meet real time deadlines

    • Catastrophic events may occur when missed

  • Synchronous execution approach

    • Good for hard real time systems

      • Deterministic

      • Reactive

    • Aids static timing analysis

      • Well bounded programs

        • No unbounded loops or recursions


Synchronous languages

Synchronous Languages

  • Executes in logical time

    • Ticks

      • Sample input → computation → emit output

  • Synchronous hypothesis

    • Tick are instantaneous

      • Assumes system is executes infinitely fast

      • System is faster than environment response

    • Worst case reaction time

      • Time between two logical ticks

  • Languages

    • Esterel

    • Scade

    • PRET-C

      • Extension to C


Synchronous languages1

Synchronous Languages

  • Executes in logical time

    • Ticks

      • Sample input → computation → emit output

  • Synchronous hypothesis

    • Tick are instantaneous

      • Assumes system is executes infinitely fast

      • System is faster than environment response

    • Worst case reaction time

      • Time between two logical ticks

  • Languages

    • Esterel

    • Scade

    • PRET-C

      • Extension to C


Pret c

PRET-C

  • Light-weight multithreading in C

  • Provides thread safe memory access

  • C extension implemented as C macros


Introduction1

Introduction

  • Practical System require larger memory

    • Not all applications fit on on-chip memory

  • Require memory hierarchy

    • Processor memory gap

[1] Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 2011.


Introduction2

Introduction

  • Traditional approaches

    • Caches

    • Scratchpads

  • However,

    • Scant research for memory architectures tailored for synchronous execution and concurrency.


Caches

Caches

CPU

Main Memory


Caches1

Caches

CPU

Main Memory

  • Traditionally Caches

    • Small fast piece of memory

      • Temporal locality

      • Spatial locality

    • Hardware Controlled

      • Replacement policy

Cache


Caches2

Caches

CPU

Main Memory

  • Hard real time systems

    • Needs to model the architecture

      • Compute the WCRT

    • Caches models

      • Trade off between length of computation time and tightness

      • Very tight worse case estimate is not scalable

Cache


Scratchpad

Scratchpad

CPU

Main Memory

  • Scratchpad Memory (SPM)

    • Software controlled

    • Statically allocated

      • Statically or dynamically loaded

    • Requires an allocation algorithm

      • e.g. ILP, Greedy

SPM


Scratchpad1

Scratchpad

CPU

Main Memory

  • Hard real time systems

    • Easy to compute tight the WCRT

    • Reduces the worst case performance

    • Balance between amount of reload points and overheads

      • May perform worst than cache in the worst case performance

SPM


Tickpad

TickPAD

CPU

Main Memory

Cache

SPM

  • Good at overall performance

  • Hardware controlled

  • Good at worst case performance

  • Easy for fast and tight static analysis


Tickpad1

TickPAD

CPU

Main Memory

TPM

Cache

SPM

  • Good at overall performance

  • Hardware controlled

  • Good at worst case performance

  • Easy for fast and tight static analysis


Tickpad2

TickPAD

CPU

Main Memory

TPM

  • TickPAD Memory

    • TickPAD - Tick Precise Allocation Device

    • Memory controller

      • Hybrid between caches and scratchpads

        • Hardware controlled features

        • Static software allocation

    • Tailored for synchronous languages

    • Instruction memory


Tickpad design flow

TickPAD Design flow


Pret c1

PRET-C

main

int main() {

init();

PAR(t1,t2,t3);

...

}

void thread t1() {

compute;

EOT;

compute;

EOT;

}

t1

t3

t2


Pret c2

PRET-C

main

Computation

int main() {

init();

PAR(t1,t2,t3);

...

}

void thread t1() {

compute;

EOT;

compute;

EOT;

}

t1

t3

t2


Pret c3

PRET-C

main

Spawn children threads

int main() {

init();

PAR(t1,t2,t3);

...

}

void thread t1() {

compute;

EOT;

compute;

EOT;

}

t1

t3

t2


Pret c4

PRET-C

main

End of tick – Synchronization boundaries

int main() {

init();

PAR(t1,t2,t3);

...

}

void thread t1() {

compute;

EOT;

compute;

EOT;

}

t1

t3

t2


Pret c5

PRET-C

main

Child thread terminate

int main() {

init();

PAR(t1,t2,t3);

...

}

void thread t1() {

compute;

EOT;

compute;

EOT;

}

t1

t3

t2


Pret c6

PRET-C

main

Main thread resume

int main() {

init();

PAR(t1,t2,t3);

...

}

void thread t1() {

compute;

EOT;

compute;

EOT;

}

t1

t3

t2


Pret c execution

PRET-C Execution

main

t1

t3

t2

Sample inputs

Time


Pret c execution1

PRET-C Execution

main

t1

t3

t2

main

Time


Pret c execution2

PRET-C Execution

main

t1

t3

t2

main

t1

Time


Pret c execution3

PRET-C Execution

main

t1

t3

t2

main

t1

t2

Time


Pret c execution4

PRET-C Execution

main

t1

t3

t2

main

t1

t2

t2

Time


Pret c execution5

PRET-C Execution

main

t1

t3

t2

Emit Outputs

main

t1

t2

t2

Time


Pret c execution6

PRET-C Execution

main

t1

t3

t2

1 tick (reaction time)

main

t1

t2

t2

Time


Pret c execution7

PRET-C Execution

main

t1

t3

t2

local tick

main

t1

t2

t2

Time


Assumptions

Assumptions

4 Instructions

1 Cache Line

Takes 1 burst transfer from main memory

Cache miss, takes 38 clock cycles [2]

buffer

Buffers are 1 cache line in size

Each instructions takes 2 cycles to execute

2. J. Whitham and N. Audsley. The Scratchpad Memory Management Unit for Microblaze: Implémentation, Testing, and Case Study. Technical Report YCS-2009-439, University of York, 2009.


Tickpad overview

TickPAD - Overview


Precision timed embedded systems using tickpad memory

TickPAD - Overview

  • Spatial memory pipeline

    • To accelerate linear code


Precision timed embedded systems using tickpad memory

TickPAD - Overview

  • Associative loop memory

    • For predictable temporal locality

      • Statically allocated and Dynamically loaded


Precision timed embedded systems using tickpad memory

TickPAD - Overview

  • Tick address queue

    • Stores the resumptions address of active threads


Precision timed embedded systems using tickpad memory

TickPAD - Overview

  • Tick instruction buffer

    • Stores the instructions at the resumption of the next active thread

    • To reduce context switching overhead at state/tick boundaries


Precision timed embedded systems using tickpad memory

TickPAD - Overview

  • Command table

    • Stores a set of commands to be executed by the TickPAD controller.


Precision timed embedded systems using tickpad memory

TickPAD - Overview

  • Command buffer

    • A buffer to store operands fetched from main memory

    • Command requiring 2+ operands


Spatial memory pipeline

Spatial Memory Pipeline

  • Cache – on miss

    • Fetches from main memory on to cache

      • First instruction miss, subsequence instructions on that line hits

    • Requires history of cache needed for timing analysis

  • Scratchpad – unallocated

    • Executes from main memory

      • Miss cost for all instructions

    • Simple timing analysis


Spatial memory pipeline1

Spatial Memory Pipeline

  • Memory controller

    • Single line buffer

  • Simple analysis

    • Analyse previous instruction

      • First instruction miss, subsequence instructions on that line hits

Main Memory

CPU


Spatial memory pipeline2

Spatial Memory Pipeline

  • Computation required many lines of instructions

  • Exploit spatial locality

    • Predictability prefetch the next line of instructions

    • Add another buffer


Spatial memory pipeline3

Spatial Memory Pipeline

  • To preserve determinism

    • Prefetch only active if no branch


Spatial memory pipeline4

Spatial Memory Pipeline


Spatial memory pipeline5

Spatial Memory Pipeline


Spatial memory pipeline6

Spatial Memory Pipeline


Spatial memory pipeline7

Spatial Memory Pipeline


Spatial memory pipeline8

Spatial Memory Pipeline


Spatial memory pipeline9

Spatial Memory Pipeline


Spatial memory pipeline10

Spatial Memory Pipeline


Spatial memory pipeline11

Spatial Memory Pipeline


Spatial memory pipeline12

Spatial Memory Pipeline


Spatial memory pipeline13

Spatial Memory Pipeline


Spatial memory pipeline14

Spatial Memory Pipeline

  • Timing analysis

    • Simple to analyse

    • Analysis next instruction line

      • If has a branch next target line will miss

        • e.g. 38 clock cycles

      • Else – will be prefetched

        • e.g. 38 – 8 = 30 clock cycles


Spatial memory pipeline15

Spatial Memory Pipeline

  • Timing analysis

    • Simple to analyse

    • Analysis next instruction line

      • If has a branch next target line will miss

        • e.g. 38 clock cycles

      • Else – will be prefetched

        • e.g. 38 – 8 = 30 clock cycles


Spatial memory pipeline16

Spatial Memory Pipeline

  • Timing analysis

    • Simple to analyse

    • Analysis next instruction line

      • If has a branch next target line will miss

        • e.g. 38 clock cycles

      • Else – will be prefetched

        • e.g. 38 – 8 = 30 clock cycles


Tick address queue tick instruction buffer

Tick Address QueueTick Instruction Buffer

  • Reduce cost of context switching

  • Maintains a priority queue

    • Thread execution order

  • Prefetches instructions from next thread

  • Make context switching points appear as linear code

    • Paired using Spatial Memory Pipeline


Tick address queue tick instruction buffer1

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer2

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer3

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer4

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer5

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer6

Tick Address QueueTick Instruction Buffer

Context switching –memory cost same as linear code


Tick address queue tick instruction buffer7

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer8

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer9

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer10

Tick Address QueueTick Instruction Buffer


Tick address queue tick instruction buffer11

Tick Address QueueTick Instruction Buffer

  • Timing analysis

    • Same prefetch lines for allocated context switching points


Associative loop memory

Associative Loop Memory

  • Statically Allocated

    • Greedy

      • Allocates inner most look first

  • Fetches Loop Before Executing

    • Predictable – easy and tight to model

    • Exploits temporal locality


Command table

Command Table

  • Statically AllocatedA Look Up table to dynamically load

    • Tick Instruction Buffer

    • Tick Queue

    • Associative Loop Memory

  • Command are executed when the PC matches the address stored on the command

    • Allows the TickPAD to function without modification to source code

      • Libraries

      • Propriety programs


Command table1

Command Table

  • Three fields

    • Address

      • The PC address to execute the command

    • Command

      • Discard Loop Associative Memory

      • Store Loop Associative Memory

      • Fill Tick Instruction Buffer

      • Load Tick Address Queue

    • Operand

      • Data used by the command


Command table allocation

Command Table Allocation


Command table allocation1

Command Table Allocation


Command table allocation2

Command Table Allocation


Command table allocation3

Command Table Allocation


Command table allocation4

Command Table Allocation


Results

Results


Results1

WCRT reduction

8.5% Locked SPMs

12.3% Thread multiplexed SPM

13.4% Direct Mapped Caches

Results


Results2

Results


Results synthesis

Results - Synthesis


Conclusion

Conclusion

  • Presented a new memory architecture

    • Tailored for synchronous programs

  • Has better worst case performance

  • Analysis time is scalable

    • Between scratchpad and abstract cache analysis

  • The presented architecture is also suitable for other synchronous languages

  • Future work

    • Data TickPAD

    • TickPAD on multicores


Thank you

Thank You


  • Login