Predictable Programming on a Precision Timed Architecture

Predictable Programming on a Precision Timed Architecture Hiren D. Patel UC Berkeley hiren@eecs.berkeley.edu Joint work with: Ben Lickly, Isaac Liu, Edward A. Lee - UC Berkeley Sungjun Kim, Stephen A. Edwards - Columbia University

Edwards and Lee - Case for PRET Patel, UC Berkeley, PRET • 2007 – Edwards and Lee made a case for precision timed computers (PRET machines) • Predictability • Repeatability S. A. Edwards and E. A. Lee, The case for the precision timed (PRET) machine. In Proceedings of the 44th Annual Conference on Design Automation (San Diego, California, June 04 - 08, 2007). DAC '07. ACM, New York, NY, 264-265. 2

Edwards and Lee - Case for PRET Patel, UC Berkeley, PRET • Unpredictability • Difficulty in determining timing behavior through analysis • Non-repeatability • Lack of guarantee that every execution yields the same timing behavior • Brittleness • Small changes have big effects on timing behavior 3

Brittleness Source: www.skycontrol.net Patel, UC Berkeley, PRET Expensive affair Tight coupling of software and hardware Reliance on testing for validation Upgrading difficult Solution: stockpile 4

But wait … Sebastian Altmeyer, Christian Hümbert, Björn Lisper, and Reinhard Wilhelm. Parametric Timing Analysis for Complex Architectures. In Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'08), pages 367-376, Kaohsiung, Taiwan, August 2008. IEEE Computer Society. Patel, UC Berkeley, PRET • Real-time scheduling • Worst-case execution time • Detailed model of hardware • Large engineering effort • Valid for particular hardware models • Interrupts, inter-process communication, locks … • Bench testing • Brittle 5

Precise Timing and High Performance Traditional Alternative Caches Scratchpads Deep out-of-order pipelines Thread-interleaved pipelines Function-only ISAs ISAs with timing instructions Function-only languages Languages and programming models with timing Best-effort communication Fixed-latency communication Time-sharing Multiple independent processors Patel, UC Berkeley, PRET 6

Outline Patel, UC Berkeley, PRET Introduction Related Work PRET Machine Programming Example Future Work Conclusion 7

Related Work Patel, UC Berkeley, PRET • Java Optimized Processor • Schoeberl et al. [2003] • Timing instructions • Ip and Edwards [2006] • Reactive processors • Von Hanxleden et al. [2005] • Salcic et al. [2005] • Virtual Simple Architecture • Mueller et al. [2003] 8

Semantics of Timing Instructions Deadline instructions Denote the required execution time of a block When decoded Stall instruction if timer value is not 0 Otherwise set timer value to new value deadi $t0, 10 … deadi $t0, 8 … deadi $t0, 0 … L0: … deadi $t0, 10 b L0 … Straight Line Block 0 Straight Line Block 1 Loop Block Patel, UC Berkeley, PRET 9

A: deadi $t0, 6 B: sethi %hi(0x3f800000), %g1 C: or %g1, 0x200, %g1 D: st %g1, [ %fp + -12 ] E: deadi $t0, 8 F: … 0 6 5 4 3 2 1 0 8 Tracing A Program Fragment cycle $t0 Patel, UC Berkeley, PRET

Precision Timed Architecture Scratchpad memories Round-robin thread scheduling Thread-interleaved pipeline Time-triggered main memory access Patel, UC Berkeley, PRET 11

Memory Hierarchy Core Main Mem. SPM SPM SPM SPM SPM SPM DMA Patel, UC Berkeley, PRET • Clocks • Main clock • Derived clocks • Instruction and data scratchpad memories • 1 cycle access latency • Main memory • 16MB size • Latency of 50ns • Frequency:250Mhz • ~13 cycles latency 12

Thread-interleaved Pipeline Decrement Deadline Timers Fetch F/D Decode D/R Stall if Deadline Instruction Reg. Access R/E Execute Check main memory access E/M Memory M/W Increment PC WriteBack Patel, UC Berkeley, PRET • Thread stalls • Main memory access • Multi-cycle operations • Deadline instructions • Replay mechanism • Execute same PC next iteration • Multi-cycle ALU ops replay instructions 13

Best-case access time If accessed 1st cycle Worst-case access time If accessed 2nd cycle of window Time-Triggered Access through Memory Wheel • Decouple thread’s access pattern • Time-triggered access 90 cycles until thread0 completes On time On time On time On time On time thread0 thread1 thread2 thread3 thread4 thread5 thread0 Patel, UC Berkeley, PRET 14

Tool Flow GCC 3.4.4, SystemC 2.2, Python 2.4 Boot code Motorola SREC files GCC to compile boot code and program code C programs timing instructions Patel, UC Berkeley, PRET 15

Simple Mutual Exclusion Example Write to output Write to shared data Read from shared data Patel, UC Berkeley, PRET • Producer followed by Consumer and Observer • Consumer and Observer execute together • Loop rate of two rotations of memory wheel • 1st for Producer to write • 2nd Consumer and Observer to read 16

Video Game Example Main-Control Thread Graphic Thread VGA-Driver Thread Pixel Data Command Even Buffer Even Queue Command Pixel Data Odd Buffer Odd Queue Swap (When Sync Requested and When Odd Queue Empty) Swap (When sync requested and when Vertical blank) Update Screen (Sync request) Refresh (Sync request) Sync (After queue swapped) Sync (After buffer swapped) Patel, UC Berkeley, PRET 17

Timing Requirements Signal Timing Requirement Pixel Cycles V. Sync 64µs 1611 V. Back-porch 1.02ms 25679 Draw 480 lines 15.25ms V. Front-porch 350µs 8811 H. Sync 3.77µs 96 H. Back-porch 1.89µs 48 Draw 640 pixels 25.42µs H. Front-porch 0.64µs 16 Patel, UC Berkeley, PRET 18

Timing Implementation Patel, UC Berkeley, PRET • Pixel-clock using derived clock • 25.175Mhz • ~ 39.72ns cycle period • Drawing 16 pixels 19

Future Work Architecture DMA DDR2 main memory model Thread synchronization primitives Shared data between threads Real-time Benchmarks With timing requirements Programming models Memory allocation schemes Synchronizations Patel, UC Berkeley, PRET 20

Conclusion What we want … Time as a first class citizen of embedded computing Predictability Repeatability Where we are at … PRET cycle-accurate simulator Release … Patel, UC Berkeley, PRET 21

Patel, UC Berkeley, PRET

Extras Patel, UC Berkeley, PRET

More on Brittleness • Small changes may have big effects on timing behavior Theorem (Richard’s anomalies): If a task set with fixed priorities, execution times, and precedence constraints is optimally scheduled on a fixed number of processors, then increasing the number of processors, reducing execution times, or weakening precedence constraints can increase the schedule length. Richard L. Graham, “Bounds on the performance of scheduling algorithms”, in E. G. Coffman, Jr.(ed.), Computer and Job-Shop Scheduling Theory, John Wiley, New York, 1975. Patel, UC Berkeley, PRET

T1/3 T2/2 T3/2 T4/2 1 2 3 4 9 5 6 7 8 T9/9 T5/4 T6/4 T7/4 T8/4 Richard’s Anomalies • 9 tasks, 3 processors, priority list, precedence order, execution times. 0 3 12 Patel, UC Berkeley, PRET

T1/2 T2/1 T3/1 T4/1 1 2 3 4 9 5 6 7 8 T9/8 T5/3 T6/3 T7/3 T8/3 Richard’s Anomalies: Reducing Execution Times • eTime’ = eTime - 1 0 3 12 Patel, UC Berkeley, PRET

T1/3 T2/2 T3/2 T4/2 1 2 3 4 9 5 6 7 8 T9/9 T5/4 T6/4 T7/4 T8/4 Richard’s Anomalies: More Processors • 4 processors 0 3 12 15 Patel, UC Berkeley, PRET

T1/3 T2/2 T3/2 T4/2 1 2 6 3 7 4 3 8 9 T9/9 T5/4 T6/4 T7/4 T8/4 Richard’s Anomalies: Changing Priority List • L = (T1,T2,T4,T5,T6,T3,T9,T7,T8) 0 3 12 Patel, UC Berkeley, PRET

Brittleness Again… • In general, all task scheduling strategies are brittle Patel, UC Berkeley, PRET

Predictable Programming on a Precision Timed Architecture