1 / 28

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors. Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore. Scratchpad Memory (Basics). Scratchpad Memory A fast and software controlled on-chip memory Each memory access is predictable Problems

makaio
Download Presentation

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore

  2. Scratchpad Memory (Basics) • Scratchpad Memory • A fast and software controlled on-chip memory • Each memory access is predictable • Problems • Cumbersome and error prone if managed by user • Need extensive compiler support for automatic management

  3. Over-estimation Observed BCET Observed WCET Actual WCET Actual BCET Actual WCET Estimated WCET Actual Estimated BCET Observed Execution Time WCET = Worst-case Execution Time BCET = Best-case Execution Time Scratchpad Allocation • Worst case optimization vs Average case optimization • This work is on worst case

  4. Scratchpad Allocation Task Task graph Task graph SPM ........ SPM-0 SPM-n • Previous work in our group • (Suhendra et. al. RTSS’05) • Software cache locking • (Puaut ECRTS’07) • Previous work in our group (Suhendra et. al. TOPLAS’10) Task graph Task graph • This paper with next level of memory accessed by shared bus ..…… SPM-0 SPM-n

  5. An MPSoC Architecture MPSOC PE-0 PE-1 PE-N …… SPM-0 SPM-1 SPM-N Fast on-chip communication media External Memory Interface Shared off-chip data bus Off-chip memory

  6. SPM architecture • Bypassing memory hierarchy • Each memory access is predictable – crucial for time predictable embedded systems • Non-bypassing memory hierarchy • Acts like a fully associative cache • Spilling and reloading of memory blocks lead to unpredictable execution time

  7. Allocation strategy • Consider data memory allocation • Variable locations (private SPM, remote SPM or external memory) are computed at compile time • If two variables share the same space, they are guaranteed to have disjoint lifetime • No reloading cost required

  8. Motivation • Why shared bus makes it different ? m m’ this core slot this core slot Bus slots for other cores (150 cycles) freq(m’) > freq(m) Total delay for ref(m) = (150 + LAT) * freq(m) Total delay for ref(m’) = LAT * freq(m’) 150 * freq(m) > LAT * (freq(m’) – freq(m)) (m likely to reduce WCET more than m’) Because shared bus delay is variable

  9. Motivation • Why shared SPM space makes it different ? Allocator’s View Not so critical task Critical task SPM - 0 SPM - 1 An SPM allocator unaware of shared scratchpad space cannot allocate memory blocks accessed in critical tasks to SPM-1

  10. Allocator’s View Not so critical task Critical task SPM - 0 SPM - 1 Exploiting shared scratchpad space, more performance can be obtained as the critical tasks can also allocate in SPM-1 Motivation • Why shared SPM space makes it different ?

  11. Allocation framework Application task graph Bus-delay aware WCET analysis Total delay (bus delay + memory latency) to access variables along WCEP Task WCET WCRT analysis Variable lifetime and critical path information Bus aware SPM allocator SPM allocation decision Enough space? Yes Optimized WCRT No

  12. Allocation framework Application task graph Bus-delay aware WCET analysis Total delay (bus delay + memory latency) to access variables along WCEP Task WCET WCRT analysis Variable lifetime and critical path information Bus aware SPM allocator SPM allocation decision Enough space? Yes Optimized WCRT No

  13. Bus delay aware WCET analysis • Shared bus introduces variable latency for each memory access. • Our previous work approximates the total delay incurred by a static memory reference. • This delay is used as a metric by the greedy SPM allocator.

  14. Allocation framework Application task graph Bus-delay aware WCET analysis Total delay (bus delay + memory latency) to access variables along WCEP Task WCET WCRT analysis Variable lifetime and critical path information Bus aware SPM allocator SPM allocation decision Enough space? Yes Optimized WCRT No

  15. WCRT analysis Task lifetime : [eStart, lFinish] eStart(t1) = 0 eStart(t4) >= eFinish(t2) eFinish(t4) >= eFinish(t3) eFinish = eStart + BCET lStart(t4) >= lFinish (t2) lStart(t4) >= lFinish (t3) t3 can be preempted by t2 lFinish (t3) = lStart(t3) + WCET(t3) + WCET(t2) + 2 * BUS_SLOT_LENGTH Computed WCRT = lFinish(t4) (1) t1 Earliest time computation t2 (2) (2) t3 t4 (1) Latest time computation Assigned core Task graph All tasks have the same period – the period of the entire task graph

  16. Allocation framework Application task graph Bus-delay aware WCET analysis Total delay (bus delay + memory latency) to access variables along WCEP Task WCET WCRT analysis Variable lifetime and critical path information Bus aware SPM allocator SPM allocation decision Enough space? Yes Optimized WCRT No

  17. Bus aware SPM allocator • Using WCRT analysis we also obtain the lifetime information of each variable • Interference graph • Each node is a variable accessed in some task • An edge exists between two nodes if their lifetimes interfere • Nodes have weights • Higher the total access delay (including bus delay), higher the weight • Higher the weight if accessed in critical path

  18. Bus aware SPM allocator N = 10 M1 = 10 N = 10 C5 = 40 C1 = 30 C2 = 55 C6 = 55 C3 = 10 N = 5 M2 = 10 M3 = 10 M3 = 10 C4 = 10 M1 = 10 Critical Task = T1 M2 suffers more delay to access than M1. Reduce WCRT by reducing the WCRT of critical task T1 Task T2 in PE-1 Task T1 in PE-0 Assume M1, M2 and M3 are only memory accesses

  19. Bus aware SPM allocator M1 = 10 [0,690] M1 M2 [375,650] N = 10 C5 = 40 N = 10 C1 = 30 M3 [455, 480] C2 = 55 SPM-0 C6 = 55 M2 C3 = 10 T2 N = 5 M2 = 10 M3 = 10 T1 SPM-1 M3 = 10 (empty) t = 480 C4 = 10 M1 = 10 t = 530 WCRT = 530 cycles Allocation (iteration 1) Task T1 in PE-0 Task T2 in PE-1 Critical Task = T1 Reduce WCRT by reducing the WCRT of critical task T1

  20. Bus aware SPM allocator [0,530] M1 M2 [375,510] M1 = 10 N = 10 C5 = 40 N = 10 C1 = 30 M3 [455, 480] C2 = 55 SPM-0 C6 = 55 M2 T1 T2 C3 = 10 N = 5 M2 = 10 M3 = 10 SPM-1 t = 464 M3 = 10 M1 C4 = 10 t = 480 M1 = 10 WCRT = 480 cycles Allocation (iteration 2) Task T1 in PE-0 Task T2 in PE-1 Critical Task = T2

  21. Bus aware SPM allocator M1 = 10 [0,464] M1 M2 [405,450] N = 10 C5 = 40 N = 10 C1 = 30 M3 [455, 480] C2 = 55 SPM-0 C6 = 55 (M2, M3) C3 = 10 T1 T2 N = 5 M2 = 10 M3 = 10 SPM-1 t = 463 t = 464 M3 = 10 M1 C4 = 10 M1 = 10 WCRT = 464 cycles Allocation (iteration 3) Task T1 in PE-0 Task T2 in PE-1 M2 and M3 have disjoint lifetimes, allocate same space

  22. Experimental evaluation • Two real world applications • An unmanned aerial vehicle (UAV) controller (papabench) • A fragment of an in-orbit spacecraft software (Debie) • Compare WCRT improvement with different • SPM size (default: 5% of total data size) • Bus slot length (default: 50 cycles to each core) • Remote SPM latency (default: 4 cycles) • Compare improvement with bus unaware SPM allocation

  23. WCRT Impr. w.r.t. SPM size MIS = SPM allocation using our framework NOBUS = Bus unaware SPM allocator Improvement over bus unaware allocator = 50%

  24. WCRT impr. w.r.t. bus slot length Average improvement = 52%

  25. Worst case vs average case • SIM(MIS) • Average case improvement • MIS • - Worst case improvement

  26. Allocation framework Application task graph Bus-delay aware WCET analysis Total delay (bus delay + memory latency) to access variables along WCEP Task WCET WCRT analysis Variable lifetime and critical path information Bus aware SPM allocator SPM allocation decision Enough space? Yes Optimized WCRT No

  27. Allocation framework Application task graph Analysis of different bus arbitration policies Bus-delay aware WCET analysis Total delay (bus delay + memory latency) to access variables along WCEP Task WCET WCRT analysis Variable lifetime and critical path information Bus aware SPM allocator SPM allocation decision Enough space? Yes Optimized WCRT No

  28. Summary • We have proposed an SPM allocation framework for MPSoCs • Our goal is to reduce the worst case response time (WCRT) of an application • We consider variable bus delays in SPM allocation • Currently, we have the model for TDMA bus only, but the SPM allocation framework can be used for different types of bus arbitration policies

More Related