1 / 16

CPRE 585 Term Review

CPRE 585 Term Review. Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy. Exam Schedule. In-class (40 points) Wednesday Nov. 12 class time (75 minutes) Take home (60 points) Distributed Wednesday Nov. 12, return by 5:00pm Friday

Download Presentation

CPRE 585 Term Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy

  2. Exam Schedule In-class (40 points) • Wednesday Nov. 12 class time (75 minutes) Take home (60 points) • Distributed Wednesday Nov. 12, return by 5:00pm Friday Not two exams: different purposes, question types, and difficulty levels

  3. Performance Evaluate • Performance metrics: latency and throughput and others • Speedup • Benchmarks: design considerations, categories, examples (SPEC and TPC) • Summarizing performance • Amdahl’s Law: idea and equation • CPU time equation

  4. ISA Design • ISA types • GPR ISA variants: #oprands, use of register, immediate, and memory operands • GPR ISA design issues • memory addressing • endian and alignment • Compare RISC and CISC • ISA impacts on processor performance

  5. Instruction Scheduling Fundamentals • Dependence analysis • Data and name (anti- and output) dependences; or RAW, WAW, and WAR • Dependences through register and memory • Control dependence • CPU Time = #inst × CPI × Cycle timeCPI = CPIideal+CPIdata hazard+CPIcontrol hazard • Deep pipeline: reduce cycle time • Multi-issue and dynamic scheduling: CPIideal • Branch pred. and spec. execution: CPIcontrol hazard • Memory hierarchy: CPIdata hazard

  6. Tomasulo Algorithm Study focus: data and name dependences through registers Hardware structures • Register status table (renaming table): Help remove name dependences and build up data dependences • Reservation station: Preserve data dependences, buffer instruction states, wake up dependent instructions • Common data bus: Broadcast tag and data What are the stages of Tomasulo? Understand the big example!

  7. Precise Interrupt and Speculative Execution • What is precise interrupt and why? • In-order commit: solution for both • Central idea: maintain architectural states • Must buffer inst output after execution • Commit inst output to architecture states in program order • Flush pipeline at exceptions or mis-speculations Q: What is ROB? And its structure? Q: What is the change of pipeline?

  8. Modern Instruction Scheduling Major differences: more pipeline stages, data forwarding, decoupled tag broadcasting, may use issue queue Issue queue: RS changed to IQ; switch two pipeline stages; significant changes at registers, renaming, and ROB • Why data forwarding? • How is IQ different from RS? What is the change at pipeline • What is physical register? What is the change at rename stage? Understand the generic superscalar processor models

  9. Branch Prediction Objective: delivery instruction continuously Several functions: predict target, direction, and return address • Review BTB and BHT design • Why use saturating counter? • Why use correlating prediction • How BTB and BHT are updated? • How to calculate mis-prediction penalty? • What is return address stack? Understand tournament predictor

  10. Memory Data Flow Techniques Address dependence through memory: store->load dependences • Must buffer store outputs => store queue • Want memory-level parallelism => memory disambiguation • load bypassing and forwarding • may speculate if store address not known • Need to detect mis-speculation => load queue (violation detected on stores) Q: Where is the performance gain? Q: What are the structures of LQ and SQ? Q: How store queue and load queue are synchronized with ROB? Q: Which portion of SQ preserves architecture states? How to flush SQ and LQ? Superscalar tech: inst flow, reg. flow, and data mem. flow

  11. Limits of ILP • What may limit ILP in realistic programs? • What is the strategy to evaluate ILP limits?

  12. Cache Fundaments Cache design • What is cache? And why to use cache? • What are the 4 Qs of cache design? • Note caching happens on memory blocks • Be very familiar with cache address mapping format Cache performance • Three factors: miss rate, miss penalty, and hit time • What is AMAT? And memory stall time? • What is the final measurement of cache performance? • How to evaluate set-associative caches? • Know how to analyze memory access pattern

  13. Cache Optimization Techniques • What are desired: low cache misses, fast cache hit, and small miss penalty, with minimal complexity (ideal world) • Understand cache misses • What are three Cs? • Which techniques to reduce each type? • Involves tradeoff • E.g. cache size, block size, set associativity

  14. Reducing miss penalty or miss rates via parallelism Non-blocking caches Hardware prefetching Compiler prefetching Reducing cache hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches Reducing miss rates Larger block size larger cache size higher associativity way prediction Pseudoassociativity compiler optimization Reducing miss penalty Multilevel caches critical word first read miss first merging write buffers victim caches Improving Cache Performance Bold type:know details Others: understand concepts

  15. Virtual Memory • Why VM? What are four Qs for VM design? • How to compare cache and VM? • Be familiar with VM address mapping format • Understand flat page table; what are in PTE? • What is TLB? How does TLB work? • Why multi-level page table?

  16. Typical Memory Hierarchy Today • L1 instruction cache: small and combined with prediction (way prediction, trace cache) and prefetching (e.g. stream buffer); virtually indexed and virtually tagged • L1 data cache: • small and fast, pipelined, and likely to be set associative (Intel: 8KB, 4-way set associative) • virtually indexed and physical tagged • Write through • TLB: small (128-entry in 21264), split for inst and data, tends to be fully associative, D-TLB run in parallel with L1-data • L2 unified cache: As large as transistor budget allows; today highly set associative (e.g., 512KB 8-way for P4); write-back to reduce memory traffic • Optional L3 cache: Even larger, off-chip • Page table: multi-level, software (21264) or hardware (Intel) managed • Main memory: large but slow, high-bandwidth

More Related