Enhancing Post-Silicon Processor Debug with Incremental Cache State Dumping

Enhancing Post-Silicon Processor Debug with Incremental Cache State Dumping Preeti Ranjan Panda, Anant Vishnoi, and M. Balakrishnan Proceedings of the IEEE 18th VLSI System on Chip Conference (VLSI-SoC 2010) Sept. 2010 Presenter: Chun-Hung Lai

Abstract • During post-silicon validation/debug of processors, it is common to alternate between two phases: processor execution and state dump. The state dump, where the entire processor state is dumped off-chip to a logic analyzer for further processing, is a major bottleneck. We present a technique for improving debug efficiency by reducing the volume of cache data dumped off-chip, while still capturing the complete state. • The reduction is achieved by introducing hardware mechanisms to transmit only the portion of the cache that was updated since the last dump. We propose two design alternatives based on whether or not the processor is permitted to continue execution during the dump: Blocking Incremental Cache Dumping (BICD) and Non-blocking Increment Cache Dumping (NICD). We observe a 64% reduction in overall cache lines dumped and the dump time reduces to an average of 16.8% and 0.0002% for BICD and NICD respectively.

What’s the Problem • The state dump is a major bottleneck during post-silicon debug of processors • Dump processor state to off-chip • Last level cache forms the majority of the processor state • To improve debug efficiency • Reduce the volume of cache data dumped • While still capture the complete state Large amount cache Large cache dump size Huge dump duration

Related Works Compression specific memory/cache data Design for debug Scan-based debug for physical / logic probing [17][20] Collection of selected signal traces For performance / energy For Debug Expand few trace signal to restore untraced signal Reduce area overhead and dump time Decompression without impacting μp execution Halt real time execution Decompression in off-line Trace compression [6][10][18] Trace signal selection [9][11][13][15] Conservative compression [1][4][12][21] Aggressive compression [14][18] Capture only error-data; zoom in interval of error signature Compression is limited Dump simultaneously with μp execution Iterative silicon debug with signature [2] Online cache dump for μp debug [19] Only for repeatable Reduce dump size Incremental cache state dumping This paper:

Incremental Cache Dumping • Goal: reduce total amount of cache data to be transferred off-chip • Dump only the cache lines that are updated since last dump • Use an Update History Table (UHT) to track all cache updates between two consecutive dumps μp execution -> $ update Time Dump all Dump updated only

Two Methodologies for Incremental Cache Dumping – 1st BICD • Blocking Incremental Cache Dumping (BICD) • Processor is halted during the cache dump • Dump lines whose UHT entry is set • Cost-dump time trade-offs • Each UHT bit represents more than one $ line • May lead to extra dump Don’t update but dump Reduce 56%dump size Blocking Incremental Cache Dumping (BICD)

Two Methodologies for Incremental Cache Dumping – 2nd NICD • Non-Blocking Incremental Cache Dumping (NICD) • Cache dump is performed simultaneously with μp execution • Two challenges with NICD • (1) Cache state is corrupted by the executing processor • Reset the corresponding UHT entry after dumping UHT Being dumped 0 0 1 1 2 Dump 0 3 Updated 1 Non-Dumped • Solution: dump a cache line before the cache attempts to update it

Two Methodologies for Incremental Cache Dumping – 2nd NICD (Cont.) • Two challenges with NICD • (2) Maintenance of the Update History Table (UHT) • UHT get incorrectly updated with the “cache dump” and the “executing μp“ • UHT-P(previous): cache updates since the last dump (indicate dump) • UHT-C(current): cache updates during dump interval (Swap their roles at the start of the next dump) Update but don’t affect current dump Dump before update UHT UHT UHT 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 0 0 0 3 3 3 Updated Non-dumped 1 Updated Updated 0 1 Time: T Time: T+1 Time: T+2 • Solution: use two UHTs

Illustration of Non-Blocking Incremental Cache Dumping (NICD) - 1 Indicate lines to be dumped Dump then reset UHT entry Update line F during dump line B • Dump F then reset UHT-P entry • Update F then set UHT-C entry

Illustration of Non-Blocking Incremental Cache Dumping (NICD) - 2 2 UHT-P= ‘0’ -> not dump 3 UHT-P= ‘0’ -> not dump Update line C but don’t affect current dump Update line H but don’t affect current dump 1 Ready for next dump Dump only ‘0’ for F since has been dumped due to update • For next dump: • - UHT-P: capture further updates • - UHT-C: indicate lines to be dumped

Hardware Implementation – NICD Arch. Counter: dump line’s index Mask: addr. of updated window UHTs: track $ updates Use for update Use for dump • Export from $: • - W_sel: Updated way • Write: $ update • Dump: ready for dump W_sel

Hardware Implementation- Operation Flow 1 Dump_S: start dump 2 Sense Valid & Dump ->data lines to buffer 3 • Cache updates (Write): • If the UHT entry = ‘1’ • then dump in advance Dump line Valid from UHT W_sel

Experimental Results- Lines Dumped at Various Dump Intervals / Window Sizes • For CHESS • Lines dumped increases with “window size” and “dump interval” • For HMMER • Difference is minimal with respect to window size Increase with the dump interval Increase with the dump interval For window size 1: only 36% of total lines is dumped in average

Experimental Results- Processor Stalls with NICD • Cache updates during the dumping of a window -> stall • For CHESS • Average 0.0005% stall overhead for window size 2 • For HMMER • Average 0.0001% stall overhead for window size 2 • Memory requests are spread over time with infrequent updates Stalls increase with window size Stalls increase with window size

Experimental Results – Dump Time Overhead for NICD • Total dump time overhead • Processor stall overhead + dumping overhead (bus busy during dump) • For CHESS • 0.0002% dump time for all dump intervals (window size 1) • As a percentage of the original dump time • For HMMER • 0.0009% ~ 0.003% dump time for all dump intervals (window size 1) Overall dump time follows the trends of processor stalls (increase with window size) Overall dump time follows the trends of processor stalls (increase with window size)

Experimental Results – Area / Access Time • Additional area / timing overhead • For BICD • Require a UHT (vary window size between 1 and 16) • Area: 0.24 ~ 0.03 • Timing: no overhead (UHT access time is smaller than cache access time) • For NICD • Dump logic • Area: twice of BICD (no extra timing overhead) • Cache modification for online dumping • Area difference is 0.0002 mm2 (no extra timing overhead) Dump logic $ modification - 180 nm synthesis technology Require 2 UHTs

Conclusions • This paper proposed an incremental cache dumping • Goal: reduce transfer time and logic analyzer space requirement • Two hardware mechanisms • Blocking Incremental Cache Dumping (BICD) • Non-blocking Incremental Cache Dumping (NICD) • The results show that • Incremental dumping reduces the lines dumped by 64% • BICD: reduce dump time to 16.2% of the original dump time • NICD: reduce dump time to 0.0002% of the original dump time

Comments for This Paper • Good points • Let me understand how to use cache dumping for debug • Signature based debugging approach • Map a sequence of events into a cache state dump • Factors for dump time overhead • Things can be improved • Why “dump line’s index” doesn’t import to the UHT • From the architecture, it seems to use single-port SRAM • How to achieve “cache line dump” and “normal cache access” simultaneously • Environment for transferring from dump logic to logic analyzer is not clear

Enhancing Post-Silicon Processor Debug with Incremental Cache State Dumping

Enhancing Post-Silicon Processor Debug with Incremental Cache State Dumping

Presentation Transcript

Strategies for Post-Silicon Debug of Complex Integrated Circuits and Systems-on-Chip

Debug

DUMPING

DEBUG

Post Silicon Test Optimization

A Hybrid Approach for Fast and Accurate Trace Signal Selection for Post-Silicon Debug

MOVEM – Graphical Post Processor

BackSpace: Formal Analysis for Post-Silicon Debug

DD* Lite: Efficient Incremental Search with State Dominance

[Post Processor]

Video Post Processor

Visibility Enhancement for Silicon Debug

Problem A1 Failure Candidate Identification for Silicon Debug

Pre Silicon to Post Silicon Verification

Enhancing and Optimizing the Render Cache

BackSpace: Formal Analysis for Post-Silicon Debug