1 / 29

KILO-INSTRUCTION PROCESSORS

KILO-INSTRUCTION PROCESSORS. Arzucan Özgür Department of Computer Engineering Boğaziçi University. 15.12.2005 Cmpe 511. Introduction. Memory Wall. 60%/yr. 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year). Performance. 10. RAM 7%/yr.

chacha
Download Presentation

KILO-INSTRUCTION PROCESSORS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KILO-INSTRUCTION PROCESSORS Arzucan Özgür Department of Computer Engineering Boğaziçi University 15.12.2005 Cmpe 511

  2. Introduction

  3. Memory Wall 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 RAM 7%/yr. • Performance improvements of high-frequency micro-processors is seriously limited by main memory access latencies RAM 1 1980 1983 1984 1986 1987 1990 1994 1995 1981 1982 1985 1988 1989 1991 1992 1993 1996 1997 1998 1999 2000 Time

  4. Reducing Memory Latency

  5. Memory L1 Instr. L2 Branch misprediction L1 Data Next IP Next IP Fetch Fetch Drive Alloc. Rename Rename Queue Schedule Schedule Schedule Dispatch Dispatch Reg. Read Reg. Read Execute Flags Br. chk Drive Cache memory hierarchies • Cache memory hierarchies • First level (L1) cache built into the processor core • Takes 1-3 processor clock cycles to access • If there is a miss in the L1 cache  on-chip L2 cache accessed in the order of 10 processor cycles • Accessing main memory takes at least in the order of 100 processor cycles • Prefetching data from memory to the cache • Prefetch addresses hard to predict

  6. Out-of-order superscalar processors

  7. Sequence of instructions containing data cashe misses

  8. Kilo-Instruction Processors

  9. Definition • An out-of-order superscalar processor that supports thousands of “in-flight instructions” • Intelligent use of resources

  10. Scalability • Thousands of In-flight Instructions and In-Order Commit make designs impractical: • ROB : Needs to maintain a copy of every in-flight instruction • IQs : Instructions depending on long latency instructions remain in these queues for a long time • LSQs : Instructions remain in the queue until commit • Registers : A new physical register for each instruction producing a new value • We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements

  11. Efficient Kilo-Instruction Processor Design • Multi-Checkpointing the ROB • Out-of-Order Commit • Early Release of Resources • Ephemeral Registers • Load Queues

  12. Checkpointing

  13. Checkpointing • ROB allows of the restoration of the correct state at any instruction (not necessary) • Checkpoint a snapshot of the processor state taken at a specific instruction of the program being executed (checkpoint processor state for a subset of instructions) • With this snapshot the processor can restore state to that point in case of an exception or misprediction

  14. Design Decisions • How many in-flight checkpoints should be maintained by the processor? • large number of checkpoints reduce the penalty of the recovery process • large number of checkpointsincrease the implementation cost • What kind of instructions should be checkpointed? • take acheckpoint at any instruction • some instructions are better candidates (ex:some current processors take checkpoints atbranch instructions in order to minimize the branch misprediction penalty) • How much information should be kept by each checkpoint?

  15. Multicheckpointing

  16. Selective Checkpointing • Replace ROB  Pseudo-ROB • Processor removes instructions that reach the pseudo-ROB’s head at fixed rate • Processor state is recovarable for any instruction in the pseudo-ROB • Checkpoint taken when incomplete instruction leaves the pseudo-ROB

  17. Instruction Queue Management

  18. Bi-level Issue Queue • Processor detects instructions that will hold an issue queue for a long time • Removes this instructions from primary issue queue • Offloads them to slow-lane instruction queue  larger, slower, less complex • Same principle applied to load-store queue

  19. Physical Register File

  20. Ephemeral Registers • A conventional superscalar processor assigns registers to architected registers when an instruction enters the issue queue • An instruction reserves a physical register for its entire flight time • A physical register not written a value until much later  primary function is tracking data dependencies • Use virtual registers  late register allocation • Release register if no other instruction that reads the data  early release

  21. Performance Evaluation

  22. Kilo-Instruction Multiprocessors

  23. Ideal Network

  24. References • Adrian Cristal, Oliverio J. Santana, Francisco Cazorla, Marco Galluzzi, Tanausu Ramirez, Miquel Pericas, Mateo Valero. "Kilo-Instruction Processors: Overcoming the Memory Wall," IEEE Micro, vol. 25,  no. 3,  pp. 48-57,  May/June,  2005. • A. Cristal, O. Santana, M. Valero, and J.F. Martínez. Toward kilo-instruction processors. In ACM Trans. on Architecture and Code Optimization, Vol. 1, No. 4, Dec. 2004 • Marco Galluzzi, Valentin Puente, Adrián Cristal, Ramón Beivide, José-Ángel Gregorio, Mateo Valero, A first glance at Kilo-instruction based multiprocessors, Conf. Computing Frontiers 2004: 212-221

  25. Thank you!

More Related