1 / 27

CRE652 Processor Architecture Course Objective: To gain

CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995.

tyler-hill
Download Presentation

CRE652 Processor Architecture Course Objective: To gain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CRE652 Processor Architecture • Course Objective: To gain • (1). knowledge on the current issues in processor architectures, and • (2). skills for performing architecture research • Ref. Text • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995. • Papers from ISCA , MICRO, and ICCD • Computer Architecture: A Quantitative Approach, • Hennessy and Patterson, Morgan Kaufmann.

  2. Superscalar Processor Model Reg. File rename D-Cache CT WB ROB I-Cache BTB IF IS I Buffer DP Dispatch (scheduler) reservation station Rename Instr. window • VLIW – EPIC • SMT Function units

  3. Page table pointer register Virtual address I-TLB Page table Entry with Dirty = 1 D-TLB Memory Access Flow From program counter or Load/Store Instruction Cache Memory Processor

  4. Walls: Limit in performance • ILP Wall • Memory Wall • Power Wall

  5. ILP(Instruction Level Parallelism) Fundamental limitation: data flow dependency Practical limiting factors • Instruction Window Size Branch Prediction • Data dependency Register Renaming • Memory-address Alias Memory Disambiguation • (Resource Conflicts) • (Memory Latency due to cache-miss and lack of ports)

  6. ILP(Instruction Level Parallelism) With no limiting factors i.e. infinite window, infinite renaming registers, perfect branch prediction, and all memory addresses are exactly known, the average ILP in programs are known to be quite high. But with realistic limiting factors, IPC becomes fairly restricted.

  7. ILP Limit • Foster and Riseman, “percolation of code to enhance parallel dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972. No. of Branches bypassedILP 0 (basic block) 1.72 1 2.72 2 3.61 8 7.21 36 14.8 128 24.4 ∞ 51.2

  8. ILP Limit • Spec92 H&P-Text Fig. 3.1 p. 157 ILP = 17.9 for li to 150.1 for tomcatv • M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999 With no memory aliasing, 19.62 for li – 3933.03 for mgrid (61.47 for tomcatv)   With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid

  9. ILP due to practical limiting factors Limiting Factors: (H&P-text p. 152 – 170) • Instruction Window Size more instructions to consider, better ILP potential • Branch Prediction Accuracy less wasted cycles • Renaming Registers more registers, better chance to remove WAR and WAW • Memory Aliasing more accurate memory dependency • Resources matching function unit types available to ILP

  10. ILP due to practical limiting factors Limiting Factor - Instruction Window Size Instruction Window; • set of instructions examined for simultaneous execution - reservation station + current fetch • max. no. of comparisons: no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr) • with typical window size of 64 to 128, time-critical

  11. ILP due to practical limiting factors Limiting Factor - Instruction Window Size e.g. (from H&P-Text Fig. 3.2 p. 159) ILP vs. window size note : 1. effects of window size 2. inefficiency of larger window

  12. ILP due to practical limiting factors Limiting Factor – Branch Prediction e.g. (from H&P-Text Fig. 3.3 p. 160) ILP vs. Branch prediction note : perf: perfect branch prediction comb: tournament predictor bi: bimodal predictor(2-bit counter) stat: static prediction with profiling none: no prediction note: instruction window size: 2K issue limit: 64 jmp prediction with 2K entry table

  13. ILP due to practical limiting factors Limiting Factor – Renaming Registers e.g. (from H&P-Text Fig. 3.5 p. 163) ILP vs. additional rename registers note: instruction window size: 2K issue limit: 64 combining predictor of total 8K entry jmp prediction with 2K entry table

  14. ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g ld $3, #200($4) st $5, #200($6) how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150 • Perfect – after executing program • Global reference and Stack references • Global data region • Stack access for local variables (activation records) • Unknown, i.e. assume conflicts, for heap region for dynamic data structures • Inspection – compile time region analysis

  15. ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g. (from H&P-Text Fig. 3.6 p. 164) ILP vs. aliasing detection schemes P: perfect alias resolution G/S: global/stack Ins: inspection note: instruction window size: 2K issue limit: 64 with 256 registers combining predictor of total 8K entry jmp prediction with 2K entry table

  16. ILP Limit A Realizable Superscalar Processor: H&P-Text sec.3.3 with rather realistic assumptions • 64-issue with no issue restrictions • Tournament predictor with 1K entries • 16-entry jump return predictor • 256 instruction window • No alias within window • 64 additional renaming registers note: no issue restriction is virtually impossible even for lower issue count, say 16.

  17. ILP Limit – Realistic Processor around 25%

  18. ILP Limit – Realistic Processor • ILP potential in software • ILP limited by resources • Window size • Function unit mismatch • Registers • ILP limited by dependency • Branch prediction • False Dependency • Output dependency (WAW) • Data dependency (RAW)

  19. Processor Architecture Comparison (H&P-Text Sec.3.6)

  20. Performance on SPECint2000

  21. Performance on SPECfp2000

  22. Normalized Performance: Efficiency

  23. Superscalar processor N-way Superscalar: • Fetch and decode N instructions • N “ready” instructions “issued” to function units fetch, decode, renaming, dispatch, issue, execution, writeback/commit • After issue, execution begins • The maximum number of instruction a processor can send simultaneously is the “issue width”. • Actual issue rate is much less • Fetch=Decode > Issue = Execute > Commit

  24. Note: Can we keep going with Superscalar path for better performance? • Increase instruction window Issue width Data path width → wire delay become more important factor → clustered organization may help frequent intra-cluster operations infrequent inter-cluster operations • Simpler may be better? But it does not utilize available on-chip resources fully Adapting multiprocessor approach? How to control multiprocessors for multiple instructions

  25. Note: Removing dependency limit 1. Current practice/convention of programming model imposes unnecessary dependency • WAR and WAW through memory • because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used • specific use of registers • loop counter, return address register, stack pointer, 2. Going beyond data-flow limit • Data Value prediction with speculation general value prediction; unlikely • address value prediction • constant/loop index value prediction

  26. Dealing with Other Walls • Memory Wall • Faster Multilevel Cache • Non-blocking pipelined cache • Cache in multicore processor • Transaction memory • Power Wall • Lower driving voltage • Allowing errors

  27. Adding New Functionality • Network and I/O related • Bypassing OS intervention • Multimedia • Vector instructions • Trusted Computing • Trusted Platform Module

More Related