1 / 13

Lecture 13: DRAM Innovations

Lecture 13: DRAM Innovations. Today: energy efficiency, row buffer management, scheduling. Latency and Power Wall. Power wall: 25-40% of datacenter power can be attributed to the DRAM system Latency and power can be both improved by employing

travis
Download Presentation

Lecture 13: DRAM Innovations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 13: DRAM Innovations • Today: energy efficiency, row buffer management, • scheduling

  2. Latency and Power Wall • Power wall: 25-40% of datacenter power can be • attributed to the DRAM system • Latency and power can be both improved by employing • smaller arrays; incurs a penalty in density and cost • Latency and power can be both improved by increasing • the row buffer hit rate; requires intelligent mapping of • data to rows, clever scheduling of requests, etc. • Power can be reduced by minimizing overfetch – either • read fewer chips or read parts of a row; incur penalties • in area or bandwidth

  3. Overfetch • Overfetch caused by multiple factors: • Each array is large (fewer peripherals  more density) • Involving more chips per access  more data • transfer pin bandwidth • More overfetch  more prefetch; helps apps with • locality • Involving more chips per access  less data loss • when a chip fails  lower overhead for reliability

  4. Re-Designing Arrays Udipi et al., ISCA’10

  5. Selective Bitline Activation • Additional logic per array so that only relevant bitlines • are read out • Essentially results in finer-grain partitioning of the DRAM • arrays • Two papers in 2010: Udipi et al., ISCA’10, Cooper-Balis and Jacob, IEEE Micro

  6. Rank Subsetting • Instead of using all chips in a rank to read out 64-bit • words every cycle, form smaller parallel ranks • Increases data transfer time; reduces the size of the • row buffer • But, lower energy per row read and compatible with • modern DRAM chips • Increases the number of banks and hence promotes • parallelism (reduces queuing delays) • Mini-Rank, MICRO’08; MC-DIMM, SC’09

  7. Row Buffer Management • Open Page policy: maximizes row buffer hits, minimizes • energy • Close Page policy: helps performance when there is • limited locality • Hybrid policies: can close a row buffer after it has served • its utility; lots of ways to predict utility: time, accesses, • locality counters for a bank, etc.

  8. Micro-Pages Sudan et al., ASPLOS’10 • Organize data across banks to maximize locality in a • row buffer • Key observation: most locality is restricted to a small • portion of an OS page • Such hot micro-pages are identified with hardware • counters and co-located on the same row • Requires hardware indirection to a page’s new location • Works well only if most activity is confined to a few • micro-pages

  9. Scheduling Policies • The memory controller must manage several timing • constraints and issue a command when all resources • are available • It must also maximize row buffer hit rates, fairness, and • throughput • Reads are typically given priority over writes; the write • buffer must be drained when it is close to full; changing • the direction of the bus requires 5-10 ns delay • Basic policies: FCFS, First-Ready-FCFS (prioritize row • buffer hits)

  10. STFM Mutlu and Moscibroda, MICRO’07 • When multiple threads run together, threads with row • buffer hits are prioritized by FR-FCFS • Each thread has a slowdown: S = Talone / Tshared, where T is • the number of cycles the ROB is stalled waiting for memory • Unfairness is estimated as Smax / Smin • If unfairness is higher than a threshold, thread priorities • override other priorities (Stall Time Fair Memory scheduling) • Estimation of Talone requires some book-keeping: does an • access delay critical requests from other threads?

  11. PAR-BS Mutlu and Moscibroda, ISCA’08 • A batch of requests (per bank) is formed: each thread can • only contribute R requests to this batch; batch requests • have priority over non-batch requests • Within a batch, priority is first given to row buffer hits, then • to threads with a higher “rank”, then to older requests • Rank is computed based on the thread’s memory intensity; • low-intensity threads are given higher priority; this policy • improves batch completion time • By using rank, requests from a thread are serviced in • parallel; hence, parallelism-aware batch scheduling

  12. TCM Kim et al., MICRO 2010 • Organize threads into latency-sensitive ad bw-sensitive • clusters based on memory intensity; former gets higher • priority • Within bw-sensitive cluster, priority is based on rank • Rank is determined based on “niceness” of a thread and • the rank is periodically shuffled with insertion shuffling or • random shuffling (the former is used if there is a big gap in • niceness) • Threads with low row buffer hit rates and high bank level • parallelism are considered “nice” to others

  13. Title • Bullet

More Related