1 / 20

Transactional Memory Coherence and Consistency

Transactional Memory Coherence and Consistency. Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Oluktun. Presented by Peter Gilbert ECE259 Spring 2008. Motivation.

anika
Download Presentation

Transactional Memory Coherence and Consistency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transactional Memory Coherence and Consistency • Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Oluktun Presented by Peter Gilbert ECE259 Spring 2008

  2. Motivation • Shared memory can support simple programming models • However, shared memory means complex consistency and cache coherence protocols • Consistency - must provide rules for ordering loads/stores • Tradeoff between ease of use and performance (e.g. SC vs. RC) • Cache coherence - must track ownership of cache lines • Requires low latency for many small coherence messages

  3. Motivation • Latency of small coherence messages unlikely to scale well • Interprocessor bandwidth likely to scale better • Message passing models can take advantage, but hard to program • Can we design a shared memory model with: • Simple programming model • Simple hardware • Good performance • ...and take advantage of available bandwidth?

  4. TCC • Transaction - sequence of instructions that executes speculatively on a processor and completes only as an atomic unit • All writes are buffered locally and only committed to shared memory when the transaction completes • Commit is atomic: entire effect of transaction committed to shared memory state at once • Transaction’s effects not visible to other processors until commit is performed • Processor broadcasts entire write buffer in one large commit packet • High broadcast bandwidth needed, latency not as important • Interconnect need not provide ordering • System-wide view: transactions appear to execute in commit order

  5. Dependence violations • Each processor must snoop on commit packets to detect violations • Transaction detecting violation must rollback and restart • Register checkpointing mechanism needed

  6. Consistency • Consistency is simplified: • Other consistency models: ordering rules between individual memory references • TCC: sequential ordering only between transaction commits • All references from an earlier commit appear to occur “before” all references from later commit • Interleaving between memory accesses by different processors only allowed at transaction boundaries • Can provide illusion of uniprocessor execution by imposing original program’s transaction order

  7. Coherence • Coherence is simplified: • No ownership of cache lines • Writes are buffered • Invalidation or update only occurs by snooping commit packets • Don’t rely on many latency-sensitive coherence messages

  8. Antidependencies WAW WAR • TCC automatically handles WAW and WAR

  9. Programming model • Programmer inserts transaction boundaries • Similar to threading, but no locks… less errors • One hard rule: transaction breaks cannot be placed between a load and a subsequent store of a shared value • Steps for parallelizing code with TCC: • Divide into potentially parallel transactions • Examples: loop iterations, after function calls • Transactions need not be independent • Dependence violations caught at runtime • Specify order • Tune performance

  10. Transaction Ordering • Most programs require ordering between certain transactions • Solution: assign phase number to each transaction • Only transactions from the “oldest” phase can commit • Can implement barriers or full ordering

  11. Performance tuning • How to choose transactions: • Large transactions amortize startup and commit overhead • Smaller transactions should be used when violations are frequent • Minimize amount of lost work • TCC system provides feedback about violations to facilitate tuning

  12. Hardware requirements • Write buffer • Read bit(s), modified bit, optional renamed bits for L1 cache lines • System wide commit arbitration

  13. Extensions • Double buffering • Allow a processor to work on next transaction while previous transaction waits to commit • Use additional write buffers and sets of read and modified bits Without double buffering Extra write buffer Extra write buffer and read bits

  14. Extensions • Hardware-controlled transactions • Automatically divide program into transactions as buffers overflow • Take full advantage of available buffer space • Programmer must still mark critical regions where transaction boundaries cannot occur • Automatically merge small transactions into larger ones • I/O • Must guarantee no rollback after input is read • Obtain commit permission before reading input • If ordering of outputs is important • Same idea: request commit permission immediately

  15. Evaluation • Can TCC extract parallelism for shared memory benchmarks? • How large are the read and write states which must be buffered? • What is the broadcast bandwidth requirement?

  16. Simulation results • Optimal TCC model (infinite bus bandwidth, no memory delays) extracted parallelism well for many benchmarks Speedups for automatically and manually parallelized benchmarks with optimal TCC model

  17. Read and write state • Most benchmarks needed 6-12 KB of read state and 4-8 KB of write state • Reasonable for current caches and on-chip write buffers • Most of the benchmarks requiring large read and write state can probably be divided into smaller transactions (e.g. radix_l vs. radix_s) Write state for smallest 10%, 50%, and 90% of iterations

  18. Broadcast bandwidth • If invalidate protocol is used with 32-bit addresses, average of 0.5 bytes/cycle for 32 processors • For update protocol, up to 16 bytes/cycle for 32 processors • If only dirty data is sent, only 8 bytes/cycle Average bytes/cycle broadcast by 1 IPC system with an update protocol

  19. Other parameters • Snooping requirement: significantly less than 1 address/cycle • Single snoop port per processor sufficient for up to 32 processors • Commit arbitration overhead: compiler-parallelized apps were insensitive, while performance suffered for apps with smaller transaction sizes • Extensions did not yield benefits in most cases • Extra read state bits (per-word rather than per-line) only mattered for a few applications • Double-buffering did not help • Will be useful when bandwidth is limited

  20. Conclusions • TCC simplifies consistency and coherence • No need for rules for ordering individual memory references • No need for latency-sensitive coherence messages • TCC provides a simple and flexible programming model • Correctness is guaranteed: no error-prone locks • Tuning performance based on observed violations is straightforward • Uniprocessor ordering can be achieved by ordering all transactions • An optimal TCC implementation extracts parallelism well for a wide range of benchmarks • Performance? • Will be limited by broadcast bandwidth and commit arbitration overhead • Evaluation on a realistic hardware model necessary

More Related