1 / 28

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. Jason F. Cantin , Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7 th , 2005. Overview of Idea. Coarse-Grain Coherence Tracking:

althea
Download Presentation

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7th, 2005

  2. Overview of Idea Coarse-Grain Coherence Tracking: • Monitors coherence status of memory at a multi-line granularity • Uses the coarse-grain information to identify requests that don’t need a coherence broadcast • Sends these requests directly to memory ISCA 2005

  3. Broadcast Network Data Network P P P NC $ MC P DRAM DRAM DRAM DRAM Problem Snoop-based systems support a limited number of processors • Limited broadcast bandwidth • Increasing memory latency ISCA 2005

  4. Opportunity • Some data requests don’t need a broadcast • Requests for non-shared data • Fetches of unmodified instructions • Write-backs • Some non-data requests don’t need to leave the processor • Requests to upgrade copy, but not shared • Requests to flush copies, but not cached elsewhere ISCA 2005

  5. Unnecessary Broadcasts ISCA 2005

  6. Our Approach • Identify requests that don’t need a broadcast • Send data requests directly to memory • Reduce broadcast traffic • Reduce latency in some systems • Avoid sending non-data requests externally • Further reduce broadcast traffic • Reduce latency ISCA 2005

  7. Coarse-Grain Coherence Tracking • Memory is divided into coarse-grain regions • Aligned, power-of-two multiple of cache line size • Can range from two lines to a physical page • A cache-like structure is added to each processor for monitoring coherence at the granularity of regions • Region Coherence Array(RCA) ISCA 2005

  8. Coarse-Grain Coherence Tracking • Each entry has an address tag, state, and count of lines cached by the processor • The state indicates if the processor and / or other processors are sharing / modifying lines in the region • On cache misses, the region state is read to determine if a broadcast is necessary ISCA 2005

  9. Coarse-Grain Coherence Tracking • On snoops, the region state provides a response for the region • Piggy-backed onto the conventional response • Used to update other processors’ region state • RCA maintains inclusion over caches • When regions are evicted, their lines are evicted • RCA must respond correctly if region’s lines cached • Replacement algorithm uses line count ISCA 2005

  10. Example: Conventional Snooping Network Read: P0, 100002 Read: P0, 100002 Invalid Invalid Tag State • P0 loads 100002 $0 0000 0010 Exclusive Pending Invalid 0000 $1 Invalid • MISS 0000 Invalid 0000 Invalid Data • Snoop performed Load: 100002 Data P0 P1 • Response sent • Data transfer M0 M1 ISCA 2005

  11. Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network P0 has exclusive access to region Read: P0, 100002 Invalid, Region Not Shared Read: P0, 100002 Invalid, Region Not Shared Tag State • P0 loads 100002 0000 0010 $0 Exclusive Pending Invalid 000 001 RCA Invalid Pending DI $1 0000 Invalid 000 RCA Invalid • MISS 0000 Invalid 000 Invalid 0000 Invalid 000 Invalid Data • Snoop performed Load: 100002 P0 P1 • Response sent Data • Data transfer M0 M1 ISCA 2005

  12. Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Exclusive region state, broadcast unnecessary Tag State • P0 loads 110002 $0 0010 Exclusive RCA 001 DI 0000 $1 Invalid 000 RCA Invalid • MISS, Region Hit 0011 0000 Exclusive Pending Invalid 000 Invalid 0000 Invalid 000 Invalid Data • Direct request sent Load: 110002 P0 P1 • Data transfer Read: P0, 110002 Data M0 M1 ISCA 2005

  13. Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Region not exclusive anymore Owned, Region Owned RFO: P1, 100002 Owned, Region Owned RFO: P1, 100002 • P1 stores 100002 0010 $0 Pending Invalid Exclusive RCA 001 DI DD $1 0000 0010 Pending Invalid Modified 001 RCA 000 Pending Invalid DD • MISS 0011 Exclusive 000 Invalid 0000 Invalid 000 Invalid Data • Snoop performed Store: 100002 Data • Hits in P0 cache P0 P1 • Response sent • Data transfer M0 M1 ISCA 2005

  14. Overhead • Storage space needed for RCA • 3-6% storage overhead for cache • Two bits needed in snoop response for region response • Path to memory needed to avoid broadcasts • Simple with on-chip memory controllers • May leverage data network ISCA 2005

  15. Simulator PHARMsim: • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors • Two-level hierarchy with split L1, unified L2 caches • Separate address / data networks –similar to Fireplane • Region Coherence Array with same sets/assoc. as L2 ISCA 2005

  16. Workloads • Scientific • Ocean, Raytrace, Barnes • Multiprogrammed • SPECint2000_rate • Commercial • TPC-W, TPC-B, TPC-H, SPECweb99, SPECjbb2000 ISCA 2005

  17. Broadcasts Avoided ISCA 2005

  18. Snoop Traffic Reduction – Peak 64% 51% 38% ISCA 2005

  19. Snoop Traffic Reduction – Average 47% 74% 86% ISCA 2005

  20. Execution Time 91.2% ISCA 2005

  21. Remaining Opportunity • With 512B regions, ~10% of requests are broadcast unnecessarily • A third of the 10% are region false sharing • Half of the 10% miss in RCA • Potential for prefetching ISCA 2005

  22. Inclusion Overhead --Regions with no lines cached replaced first ISCA 2005

  23. Conclusion Coarse-Grain Coherence Tracking: • Reduces broadcast traffic • Most data requests sent directly to memory • Reduces latency • Many requests not sent to central arbitration point • Many non-data requests not sent externally • Improves scalability and performance ISCA 2005

  24. The End ISCA 2005

  25. Inclusion Evictions ISCA 2005

  26. Ordering • Ordering point is now the Region Coherence Array • A direct request is ordered once it accesses the RCA • Direct requests are serialized w.r.t. to snoop requests • A direct request occurs either before, or after a snoop • All must appear to access and update RCA atomically • No two processors can have exclusive access to a region at the same time (no races) ISCA 2005

  27. Comparison to RegionScout ISCA 2005

  28. Execution Time ISCA 2005

More Related