1 / 30

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

‘99 ACM/IEEE International Symposium on Computer Architecture. Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor. Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio. Roadmap. Need for Higher Bandwidth Caches

emery
Download Presentation

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ‘99 ACM/IEEE International Symposium on Computer Architecture Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio

  2. Roadmap • Need for Higher Bandwidth Caches • Multi-Ported Data Caches • Data Decoupling • Motivation • Approach • Implementation Issues • Quantitative Evaluation • Conclusions Cho, Yew, and Lee

  3. Wide-Issue Superscalar Processors • Current Generation • Alpha 21264 • Intel’s Merced • Future Generation (IEEE Computer, Sept. ‘97) • Superspeculative Processors • Trace Processors Cho, Yew, and Lee

  4. Multi-Ported Data Caches • Cache Built with Multi-Ported Cells • Replicated Cache • Alpha 21164 • Interleaved Cache • MIPS R10K • Time-Division Multiplexing • Alpha 21264 Cho, Yew, and Lee

  5. Replicated Cache • Pros. • Simple design • Symmetric read ports • Cons. • Doubled area • Exclusive writes for data coherence Cho, Yew, and Lee

  6. Time-Division Multiplexed Cache • Pros. • True 2-port cache • Cons. • Hardware design complexity • Not scalable beyond 2 ports Cho, Yew, and Lee

  7. Interleaved Cache • Pros. • Scalable • Cons. • Asymmetric ports • Bank conflicts • Constraints in number of banks Cho, Yew, and Lee

  8. Window Logic Complexity • Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97) • More severe for Memory window • Difficult to partition • Thick network needed to connect RSs and LSUs Cho, Yew, and Lee

  9. Data Decoupling • A Divide-and-Conquer approach • Instructions partitioned before entering RS • Narrower networks • Less ports to each cache Cho, Yew, and Lee

  10. Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? Data Decoupling: Operating Issues Cho, Yew, and Lee

  11. Easily Identifiable Hardware Mechanism Simple 1-bit predictor with enough context information works well (>99.9%). Compiler Mechanism Helps reduce required prediction table space for good performance; but not essential. Many of Them 30% of loads, 48% of stores Well-Interleaved Continuous supply of stack references with reasonable window size Case for Decoupling Stack Accesses • Details are found in: • Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Processor Memory System Design”, CSTR #99-004, Univ. of Minnesota. Cho, Yew, and Lee

  12. Data Decoupling: Mechanism • Dynamically Predicting Access Regions for Partitioning Memory Instructions • Utilize Access Region Locality • Refer to context information, e.g., global branch history, call site identifier • Dynamically Verifying Region Prediction • Let TLB (i.e., page table) contain verification information such that memory access is reissued on mispredictions. Cho, Yew, and Lee

  13. Data Decoupling: Mechanism, Cont’d • Access Region Locality Cho, Yew, and Lee

  14. go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Data Decoupling: Mechanism, Cont’d • Dynamic Partitioning Accuracy 8 KB 4 KB Unlimited 2 KB 1 KB Cho, Yew, and Lee

  15. Fast Forwarding Uses offset (used with $sp) to resolve dependence Can shorten latency Access Combining Combines accesses to adjacent locations Can save bandwidth st r3, 8($sp) ... ... ld r4, 8($sp) st r3, 4($sp) st r4, 8($sp) st {r3,r4} {4,8($sp)} Data Decoupling: Optimizations Addr Matched! Cho, Yew, and Lee

  16. Benchmark Programs Cho, Yew, and Lee

  17. Program’s Memory Accesses go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

  18. 0 4 8 12 16 Program’s Frame Size Distribution • Stack references tend to access small region. • Average size of dynamic frames was around 3 words. • Average size of static frames was around 7 words. Cho, Yew, and Lee

  19. Base Machine Model Cho, Yew, and Lee

  20. Integer FP Program’s Bandwidth Requirements • Performance suffers greatly with less than 3 cache ports. • We study 3 cases: • Cache has 2 ports • Cache has 3 ports • Cache has 4 ports Cho, Yew, and Lee

  21. 0.5K 1K 2K 4K Impact of LVC Size • 2KB and 4KB LVCs achieve high hit rates (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee

  22. Fast Data Forwarding • 2KB and 4KB LVCs achieve high hit rates (~99.9%). • Set associativity less important if LVC is 2KB or more. • Small, simple LVC works well. Cho, Yew, and Lee

  23. (3+1) (3+2) Access Combining • Effective (over 8% improvement) when LVC bandwidth is scarce. • 2-way combining is enough. Cho, Yew, and Lee

  24. Performance of Various Config.’s (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

  25. Performance of 126.gcc (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

  26. Performance of 130.li (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

  27. Performance of 102.swim (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) Cho, Yew, and Lee

  28. Other Findings • LVC hit latency has less impact than data cache due to • Many loads hitting in LVAQ • Out-of-order issuing • Addition of LVC reduced conflict misses in • 130.li (by 24%) and 147.vortex (by 7%) • May reduce bandwidth requirements on bus to L2 cache Cho, Yew, and Lee

  29. Overall Performance go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg Cho, Yew, and Lee

  30. Conclusions • Superscalar Processors will be around… • But its design complexity will call for architectural solutions. • Memory bandwidth becomes critical. • Data Decoupling is a way to • Decrease hardware complexity of memory issue logic and cache. • Provide additional bandwidth for decoupled stack accesses. Cho, Yew, and Lee

More Related