1 / 24

IBM Power 7

IBM Power 7. Harshit Dokania (hdokani2 ) Dhashrath Raguraman (raguram2) Prasanna Giridhar (giridha2). System Overview. Performance 8 Cores 4 SMT threads/core 45nm technology – 567 sq mm 1.2B transistors Integrated L1,L2 and L3( local+global ) Cache Memory On-Chip eDRAM

markku
Download Presentation

IBM Power 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM Power 7 HarshitDokania (hdokani2 ) DhashrathRaguraman (raguram2) PrasannaGiridhar (giridha2)

  2. System Overview • Performance • 8 Cores • 4 SMT threads/core • 45nm technology – 567 sq mm • 1.2B transistors • Integrated L1,L2 and L3(local+global) Cache • Memory • On-Chip eDRAM • 2 DDR 3 Memory Controllers • Micro Architecture • Fetch, Decode, Dispatch, Rename, Execute and Commit • 12 execution units/core • 2 integer and 4 binary floating-point units • 1 vector and 1 decimal floating-point unit • 2 load/store, 1 branch, 1 condition register

  3. Instruction Fetch • 8 Instructions per cycle • I-Cache • 32 KB, 4 way, 128 B cache line • Tournament Branch Predictor Tag Address Translation I Cache 32KB 4 way, 16 banks 128 B line 21bit GHV 8 Instructions Hash Addr Cache Hit/Miss

  4. Dispatch and Rename • Dispatch a group of upto 6 instructions • Physical Register Renaming • Global Completion Table

  5. Issue • Different issue queues • Unified issue queue • Branch issue queue • Unified Issue Queue • Partitioned across threads • Complexity vs. Performance • Issue upto 8 instructions per cycle BRQ UQ0 UQ1 Branch Ex Unit FX1 LS1 VS1 FX0 LS0 VS0

  6. Pipeline Commit DE IF EX MEM

  7. Simultaneous Multi Threading and Mode Selection • Idea: • A single thread will be unable to keep all these resources busy • Modes • ST: Single instruction execution thread per core • SMT2: Two instruction execution threads per core • SMT4: Four instruction execution threads per core • An 8-core Power7 can execute 32 threads simultaneous • Dynamic SMT mode selection based on throughput comparison in single thread mode vs SMT mode

  8. S80 HW Multi-thread Single thread Out of Order FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2 Way SMT POWER7 4 Way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing Comparison

  9. Partitioning in SMT • Non partitioned resources • Complexity and larger area • Power Consumption • In SMT4 • Thread 0 and Thread 1 use • {UQ0,GRP0,FXU0,LSU0} • Thread 2 and Thread 3 use • {UQ1,GRP1,FXU1,LSU1} • In ST,SMT2 • The GPR store identical contents • Functional(FX,LS) units are shared T0, T2 T1, T3 GPR0 GPR1

  10. Cache Hierarchy • L1 and L2 latency reduced • L3 on chip offers low latency and high bandwidth

  11. L1 Cache • 32KB, 8 way associativity, 128B cache lines • Increased bandwidth with concurrent reads and writes • Divided into banks • Each bank has 2 read ports and 1 write port • Follows write through policy, all stores go to L2 directly

  12. L2 Cache • All stores from L1 at low latency • 256KB, 8 way associativity • L3 local (Partial Victim Cache) management policy • determines exclusive or inclusive L3: • Exclusive provides more capacity and associativity • Inclusive helps reduce energy

  13. L3 Cache • 32MB divided into 8 regions of 4MB each: • 8 way associativity, 128B cache lines • Each region coupled with L2 cache • L3 made of eDRAM: • eDRAM has more latency and cycle time than SRAM • But still less volume occupied and less standy • energy consumed • On chip for lower latency and higher bandwidth

  14. L3 Cache Usage Other L3 can also use as victim cache L2 uses local L3 as the victim cache Higher weightage given to associated L2 compared to other L3

  15. L2 Cast out Check if there is a copy of line. If not allocate a line

  16. Lateral Cast out Check if there is a copy of line If not decide to allocate line based on capacity

  17. L3 replacement policy • All lines of L3 divided into 2 classes: • Victim lines due to cast out by associated L2 • Victim lines due to lateral cast out by other L3 • Lines are evicted in the following order: • LRU from class 2 • If no such line exists, then LRU from class 1

  18. Memory Subsystem Total Sustained ~ 100 GB/s MC0 MC1 Buffer Chips … DDR3 DRAM Chips …

  19. On-Chip Interconnect Coherence Bus 4x16B Data 4x16B Data Coherence Bus

  20. (RAS) Reliability, Availability, Serviceability • Stores previous architecture states. • On Error detection rollback. • Recovery unit is not required like POWER6 • Register renaming mechanism used. • Flushes all the instruction in pipeline • Fence the core from the rest of the system • Run an automatic built-in self-test

  21. Energy Management • While having 4 times as many cores on the chip, POWER7 chip is in the same power envelope as POWER6. • Reasons • Reduced frequency (From 2 cores on 3.6 ~ 5 GHz to 8 cores on 2.4–4.25 GHz ) • Microarchitecture optimization(For example, combining GPR, FPR and VR) • On chip eDRAM (one fifth compared to equivalent SRAM)

  22. Power Saving Techniques • Turn off threads in a core Remaining threads share the core execution resources • Turn off each core individually • cache, eDRAM reallocation • Better Core level performance • Per core frequency control • Operating requirements • Clock gating

  23. Questions

  24. THANK YOU

More Related