CRE652 Processor Architecture Course Objective: To gain

CRE652 Processor Architecture • Course Objective: To gain • (1). knowledge on the current issues in processor architectures, and • (2). skills for performing architecture research • Ref. Text • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995. • Papers from ISCA , MICRO, and ICCD • Computer Architecture: A Quantitative Approach, • Hennessy and Patterson, Morgan Kaufmann.

Superscalar Processor Model Reg. File rename D-Cache CT WB ROB I-Cache BTB IF IS I Buffer DP Dispatch (scheduler) reservation station Rename Instr. window • VLIW – EPIC • SMT Function units

Page table pointer register Virtual address I-TLB Page table Entry with Dirty = 1 D-TLB Memory Access Flow From program counter or Load/Store Instruction Cache Memory Processor

Walls: Limit in performance • ILP Wall • Memory Wall • Power Wall

ILP(Instruction Level Parallelism) Fundamental limitation: data flow dependency Practical limiting factors • Instruction Window Size Branch Prediction • Data dependency Register Renaming • Memory-address Alias Memory Disambiguation • (Resource Conflicts) • (Memory Latency due to cache-miss and lack of ports)

ILP(Instruction Level Parallelism) With no limiting factors i.e. infinite window, infinite renaming registers, perfect branch prediction, and all memory addresses are exactly known, the average ILP in programs are known to be quite high. But with realistic limiting factors, IPC becomes fairly restricted.

ILP Limit • Foster and Riseman, “percolation of code to enhance parallel dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972. No. of Branches bypassedILP 0 (basic block) 1.72 1 2.72 2 3.61 8 7.21 36 14.8 128 24.4 ∞ 51.2

ILP Limit • Spec92 H&P-Text Fig. 3.1 p. 157 ILP = 17.9 for li to 150.1 for tomcatv • M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999 With no memory aliasing, 19.62 for li – 3933.03 for mgrid (61.47 for tomcatv) With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid

ILP due to practical limiting factors Limiting Factors: (H&P-text p. 152 – 170) • Instruction Window Size more instructions to consider, better ILP potential • Branch Prediction Accuracy less wasted cycles • Renaming Registers more registers, better chance to remove WAR and WAW • Memory Aliasing more accurate memory dependency • Resources matching function unit types available to ILP

ILP due to practical limiting factors Limiting Factor - Instruction Window Size Instruction Window; • set of instructions examined for simultaneous execution - reservation station + current fetch • max. no. of comparisons: no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr) • with typical window size of 64 to 128, time-critical

ILP due to practical limiting factors Limiting Factor - Instruction Window Size e.g. (from H&P-Text Fig. 3.2 p. 159) ILP vs. window size note : 1. effects of window size 2. inefficiency of larger window

ILP due to practical limiting factors Limiting Factor – Branch Prediction e.g. (from H&P-Text Fig. 3.3 p. 160) ILP vs. Branch prediction note : perf: perfect branch prediction comb: tournament predictor bi: bimodal predictor(2-bit counter) stat: static prediction with profiling none: no prediction note: instruction window size: 2K issue limit: 64 jmp prediction with 2K entry table

ILP due to practical limiting factors Limiting Factor – Renaming Registers e.g. (from H&P-Text Fig. 3.5 p. 163) ILP vs. additional rename registers note: instruction window size: 2K issue limit: 64 combining predictor of total 8K entry jmp prediction with 2K entry table

ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g ld $3, #200($4) st $5, #200($6) how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150 • Perfect – after executing program • Global reference and Stack references • Global data region • Stack access for local variables (activation records) • Unknown, i.e. assume conflicts, for heap region for dynamic data structures • Inspection – compile time region analysis

ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g. (from H&P-Text Fig. 3.6 p. 164) ILP vs. aliasing detection schemes P: perfect alias resolution G/S: global/stack Ins: inspection note: instruction window size: 2K issue limit: 64 with 256 registers combining predictor of total 8K entry jmp prediction with 2K entry table

ILP Limit A Realizable Superscalar Processor: H&P-Text sec.3.3 with rather realistic assumptions • 64-issue with no issue restrictions • Tournament predictor with 1K entries • 16-entry jump return predictor • 256 instruction window • No alias within window • 64 additional renaming registers note: no issue restriction is virtually impossible even for lower issue count, say 16.

ILP Limit – Realistic Processor around 25%

ILP Limit – Realistic Processor • ILP potential in software • ILP limited by resources • Window size • Function unit mismatch • Registers • ILP limited by dependency • Branch prediction • False Dependency • Output dependency (WAW) • Data dependency (RAW)

Processor Architecture Comparison (H&P-Text Sec.3.6)

Performance on SPECint2000

Performance on SPECfp2000

Normalized Performance: Efficiency

Superscalar processor N-way Superscalar: • Fetch and decode N instructions • N “ready” instructions “issued” to function units fetch, decode, renaming, dispatch, issue, execution, writeback/commit • After issue, execution begins • The maximum number of instruction a processor can send simultaneously is the “issue width”. • Actual issue rate is much less • Fetch=Decode > Issue = Execute > Commit

Note: Can we keep going with Superscalar path for better performance? • Increase instruction window Issue width Data path width → wire delay become more important factor → clustered organization may help frequent intra-cluster operations infrequent inter-cluster operations • Simpler may be better? But it does not utilize available on-chip resources fully Adapting multiprocessor approach? How to control multiprocessors for multiple instructions

Note: Removing dependency limit 1. Current practice/convention of programming model imposes unnecessary dependency • WAR and WAW through memory • because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used • specific use of registers • loop counter, return address register, stack pointer, 2. Going beyond data-flow limit • Data Value prediction with speculation general value prediction; unlikely • address value prediction • constant/loop index value prediction

Dealing with Other Walls • Memory Wall • Faster Multilevel Cache • Non-blocking pipelined cache • Cache in multicore processor • Transaction memory • Power Wall • Lower driving voltage • Allowing errors

Adding New Functionality • Network and I/O related • Bypassing OS intervention • Multimedia • Vector instructions • Trusted Computing • Trusted Platform Module

CRE652 Processor Architecture Course Objective: To gain

CRE652 Processor Architecture Course Objective: To gain

Presentation Transcript

Graphics on a Stream Processor

JOP: A Java Optimized Processor for Embedded Real-Time Systems

EEM 486 : Computer Architecture Designing a Multicycle Processor

Instructor: Dr. Hyunyoung Lee

Architecture of the MSP430 Processor

The Bus Architecture of Embedded System

The Imagine Stream Processor

The Pentium Processor

ARM7TDMI

Charm++ on the Cell Processor

A Framework for the Validation of Processor Architecture Compliance

CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor

AMD K7 Processor Architecture

Processor Design

CRE652 Processor Architecture Dynamic Branch Prediction

The Pentium Processor

Some Embedded Processor Alternatives; Processors for this course: Introduction to Altera FPGAs

Instruction Set Architecture

SEMINAR ON ARM PROCESSOR