1 / 22

Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) ( jinil_chung@korea.ac.kr )

[Paper Review]. Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis + , Jeffrey Stuecheli *+ , and Lizy Kurian John + MICRO’11. + The University of Texas at Austin * IBM Corp. Korea University, VLSI Signal Processing Lab.

judson
Download Presentation

Korea University, VLSI Signal Processing Lab. Jinil Chung ( 정진일 ) ( jinil_chung@korea.ac.kr )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. [Paper Review] Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis+, Jeffrey Stuecheli*+, and Lizy Kurian John+ MICRO’11 + The University of Texas at Austin * IBM Corp. Korea University,VLSI Signal Processing Lab. Jinil Chung (정진일) (jinil_chung@korea.ac.kr)

  2. Abstract DRAM: balance between performance, power, and storage density To realize good performance, Must mange the structural and timing restrictionsof the DRAM devices Use of “Page-mode” feature can mitigate many DRAM constraints Aggressive page-mode results in many conflicts (e.g. bank conflict) when multiple workloads in many-core systems map to the same DRAM [IEEE Spectrum(link)] In this paper, Minimalist approach “just enough” page-mode accesses to get benefits, avoiding unfairness  Proposed address hashing + data prefetch engine + per request priority

  3. 1. Introduction Row buffer (or “page-mode”) Access • This paper proposed combination of open/closed-page policy based on … • Page-mode gain with only a small number of page accesses •  Propose a fair DRAM address mapping scheme: low RBL & high BLP • Page-mode hit with spatial locality which can be captured in prefetch engines •  Propose an intuitive criticality-based memory request priority scheme NOT temporal locality! RBL: Row-buffer Locality BLP: Bank-level Parallelism

  4. 2. Background DRAM timing constraintresults in“dead time” before and after random access MC(Memory Controller)’s job is to reduce performance-limiting gaps using parallelism 1) tRC (row cycle time; ACT-to-ACT @same BK) : MC activates a page  wait for tRC @same BK : multiple threads access diff. row @same BK  latency overhead (tRC delay) 2) tRP (row precharge time; PRE-to-ACT @same BK) : In open-page policy, MC activates other page  tRP penalty @same BK (=close current page before new page is opened) tRC (e.g. 48ns) tRP (e.g. 12ns) tRAS (e.g. 36ns) ACT PRE ACT @same bank

  5. 3. Motivation Use of “page-mode” … Next page Latency Effects: Due to tRC & tRP, overall latency increase  small # of access? Power Reduction: only Activate Power reduction  small # of access is enough Bank Utilization: drop off quickly as access increase  small # of access is enough Other DRAM complexities: small # of access is needed for soften restrictions ex) tFAW (Four page Activate time Window; 30ns), cache block transfer delay=3ns -. single access per ACT: limited peak utilization (6ns*4/30ns=80%) -. two~ accesses per ACT: not limited peak utilization (12ns*4/30ns>100%) If B/U is high, the probability that new request will conflict w/ a busy bank is greater. 62% 16% Closed-page policy Closed-page policy

  6. 3. Motivation 3.1 Row-buffer locality in Modern Processors : in current WS/Server class designs  large last-level cache (e.g. IBM PowerPC 7) Temporal locality: hits to the large Last-level cache Row buffers exploit only Spatial locality Using prefetch engines, It can be predict spatial locality RBL: Row-buffer Locality

  7. 3. Motivation 3.2 Bank and Row Buffer Locality Interplay with Address Mapping -. DRAM device address: row, column, and bank Workload A: long sequential access seq. Workload B: single operation Workload A: higher priority  Slow B0 (DRAM all col.  low order real addr.) e.g. FR-FCFS Workload B: higher priority  Slow A4 (DRAM all col.  low order real addr.) e.g. ATLAS, PAR-BS (DRAM col. & bank  low order real addr.) High BLP (Bank-level Parallelism)  B0 can be serviced w/o degrading traffic to the workload A e.g. Minimalist

  8. 4. Minimalist Open-page Mode 4.1 DRAM Address Mapping Scheme -. The basic difference that the Column access bits are split in two places. +. 2 LSB bits are located right after the Block bits +. 5 MSB bits are located just before the Row bits -. (Not shown in the figure) higher order address bits are XOR-ed with the bank bits produce the actual bank selection bits  reducing row buffer conflict [Zhang et al./MICRO’00] For sequential access of 4 cache lines 7-bit 5-bit 2-bit

  9. 4. Minimalist Open-page Mode 4.2 Data Prefetch Engine [IBM PowerPC 6] : predictable “page-mode” opportunities  need for accurate prefetch engine : each core includes HW prefetcherw/ prefetch depth distance predictor 1) Multi-line Prefetch Requests -. Multi-line prefetch operation: single request (to indicate specific seq. of cache lines) -. Reducing command BW and queue resource

  10. 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme : In OOO execution, the importance of each request can vary both between and within applications  need for dynamic priority scheme 1) DRAM Memory Requests Priority Calculation -. different priority based on criticality to performance -. Increase priority of each request every 100ns time interval  time-based -. 2 categories: read (normal) and prefetch  read request is higher priority -. MLP information from MSHR in each core: many misses  less important -. Distance information from Prefetch engine (4.2) Read request MLP: Memory Level Parallelism MSHR: Miss Status Holding Register

  11. 4. Minimalist Open-page Mode 4.3 Memory Request Queue Scheduling Scheme (cont.) 2) DRAM Page Closure (Precharge) Policy -. Using autoprecharge increasing command BW 3) Overall Memory Requests Scheduling Scheme (Priority Rules 1) -. Same rules are used by all of MC  No need for communication among MC -. if MC is servicing the multiple transfers from a multi-line prefetch request, it can be interrupted by a higher priority request  very critical request can be serviced w/ the smallest latency 4) Handling write operations -. dynamic priority scheme not apply to write -. Using VWQ(Virtual Write Queue)  causing minimal write instructions

  12. 5. Evaluation -. 8 core CMP system using the Simics functional model extended w/ the GEMS toolset -. Simulate DDR3 1333MHz DRAM using memory controller policy for each experiment -. Minimalist open-page scheme is compared against three open-page policies: Table 5 1) PAR-BS (Parallelism-aware Batch Scheduler) 2) ATLAS (Adaptive per-Thread Least-Attained-Service) memory scheduler 3) FR-FCFS (First-Ready, First-Come-First-Served): baseline

  13. 5. Evaluation 5.1 Throughput -. Overall, “Minimalist Hash+Priority" demonstrated the best throughput improvement over the other schemes, achieving a 10% improvement. -. This is compared against ATLAS and PAR-BS that achieved 3.2% and 2.8% throughput improvements over the whole workload suite.

  14. 5. Evaluation 5.2 Fairness -. Minimalist improves fairness up to 15% with an overall improvement of 7.5%, 3.4% and 2.5% for FR-FCFS, PAR-BS and ATLAS, respectively.

  15. 5. Evaluation 5.3 Row Buffer Access per Activation -. The observed page-access rate for the aggressive open-page policies fall significantly short  The high page hit rate is simply not possible given the interleaving of requests between the eight executing programs. -. With the Minimalist scheme, the achieved page-access rate is close to 3.5, compared to the ideal rate of four.

  16. 5. Evaluation 5.4 Target Page-hit Count Sensitivity -. The Minimalist system requires a target number of page hits to be selected that indicates the maximum number of pages hits the scheme attempts to achieve per row activation. -. a target number of 4 pages hits provides the best results. (that different system configuration may shift the optimal page-mode hit count.)

  17. 5. Evaluation 5.5 DRAM Energy Consumption -. To estimate the power consumption we used the Micron power calculator -. Approximately the same as FR-FCFS. PAR-BS, ATLAS and “Minimalist Hash+Priority" provide a small decrease of approximately 5% to the overall energy consumption. -. The energy results are essentially a balance between the decrease in page-mode hits (resulting in high DRAM activation power) and the increase in system performance (decreasing runtime).

  18. Conclusions Minimalist Open-page memory scheduling policy -. Page-mode gain w/ small number of page accesses for each page activation -. Assign per-request priority using request stream information in MLPanddata prefetch engine Improving throughput and fairness -. Throughput increased by 10% on average (compared to FR-FCSC) -. No need for thread based priority information -. No need for communication/coordination among multiple MC or OS

  19. Appendix. Detailed simulation information

  20. Appendix. Detailed simulation information

  21. Appendix. Detailed simulation information

  22. Thanks,

More Related