CRE652 Processor Architecture
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

CRE652 Processor Architecture Course Objective: To gain PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on
  • Presentation posted in: General

CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995.

Download Presentation

CRE652 Processor Architecture Course Objective: To gain

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cre652 processor architecture course objective to gain

  • CRE652 Processor Architecture

  • Course Objective: To gain

  • (1). knowledge on the current issues in processor architectures, and

  • (2). skills for performing architecture research

  • Ref. Text

  • J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995.

  • Papers from ISCA , MICRO, and ICCD

  • Computer Architecture: A Quantitative Approach,

  • Hennessy and Patterson, Morgan Kaufmann.


Cre652 processor architecture course objective to gain

Superscalar Processor Model

Reg. File

rename

D-Cache

CT

WB

ROB

I-Cache

BTB

IF

IS

I Buffer

DP

Dispatch

(scheduler)

reservation

station

Rename

Instr. window

  • VLIW – EPIC

  • SMT

Function units


Cre652 processor architecture course objective to gain

Page table pointer register

Virtual address

I-TLB

Page table

Entry with Dirty = 1

D-TLB

Memory Access Flow

From program counter or

Load/Store Instruction

Cache

Memory

Processor


Walls

Walls:

Limit in performance

  • ILP Wall

  • Memory Wall

  • Power Wall


Ilp instruction level parallelism

ILP(Instruction Level Parallelism)

Fundamental limitation: data flow dependency

Practical limiting factors

  • Instruction Window Size

    Branch Prediction

  • Data dependency

    Register Renaming

  • Memory-address Alias

    Memory Disambiguation

  • (Resource Conflicts)

  • (Memory Latency due to cache-miss and lack of ports)


Ilp instruction level parallelism1

ILP(Instruction Level Parallelism)

With no limiting factors

i.e. infinite window, infinite renaming registers, perfect branch prediction, and all memory addresses are exactly known,

the average ILP in programs are known to be quite high.

But with realistic limiting factors,

IPC becomes fairly restricted.


Ilp limit

ILP Limit

  • Foster and Riseman, “percolation of code to enhance parallel dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972.

    No. of Branches bypassedILP

    0(basic block)1.72

    12.72

    23.61

    87.21

    3614.8

    12824.4

    ∞ 51.2


Ilp limit1

ILP Limit

  • Spec92

    H&P-Text Fig. 3.1 p. 157

    ILP = 17.9 for li to 150.1 for tomcatv

  • M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999

    With no memory aliasing,

    19.62 for li – 3933.03 for mgrid (61.47 for tomcatv)

     With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid


Ilp due to practical limiting factors

ILP due to practical limiting factors

Limiting Factors: (H&P-text p. 152 – 170)

  • Instruction Window Size

    more instructions to consider, better ILP potential

  • Branch Prediction Accuracy

    less wasted cycles

  • Renaming Registers

    more registers, better chance to remove WAR and WAW

  • Memory Aliasing

    more accurate memory dependency

  • Resources

    matching function unit types available to ILP


Ilp due to practical limiting factors1

ILP due to practical limiting factors

Limiting Factor - Instruction Window Size

Instruction Window;

  • set of instructions examined for simultaneous execution - reservation station + current fetch

  • max. no. of comparisons:

    no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr)

  • with typical window size of 64 to 128, time-critical


Ilp due to practical limiting factors2

ILP due to practical limiting factors

Limiting Factor - Instruction Window Size

e.g. (from H&P-Text Fig. 3.2 p. 159)

ILP vs. window size

note :

1. effects of window size

2. inefficiency of larger window


Ilp due to practical limiting factors3

ILP due to practical limiting factors

Limiting Factor – Branch Prediction

e.g. (from H&P-Text Fig. 3.3 p. 160)

ILP vs. Branch prediction

note :

perf: perfect branch prediction

comb: tournament predictor

bi: bimodal predictor(2-bit counter)

stat: static prediction with profiling

none: no prediction

note:

instruction window size: 2K

issue limit: 64

jmp prediction with 2K entry table


Ilp due to practical limiting factors4

ILP due to practical limiting factors

Limiting Factor – Renaming Registers

e.g. (from H&P-Text Fig. 3.5 p. 163)

ILP vs. additional rename registers

note:

instruction window size: 2K

issue limit: 64

combining predictor of total 8K entry

jmp prediction with 2K entry table


Ilp due to practical limiting factors5

ILP due to practical limiting factors

Limiting Factor – Memory Aliasing

e.g ld$3, #200($4)

st$5, #200($6)

how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150

  • Perfect – after executing program

  • Global reference and Stack references

  • Global data region

  • Stack access for local variables (activation records)

  • Unknown, i.e. assume conflicts, for heap region for dynamic data structures

  • Inspection – compile time region analysis


Ilp due to practical limiting factors6

ILP due to practical limiting factors

Limiting Factor – Memory Aliasing

e.g. (from H&P-Text Fig. 3.6 p. 164)

ILP vs. aliasing detection schemes

P: perfect alias resolution

G/S: global/stack

Ins: inspection

note:

instruction window size: 2K

issue limit: 64 with 256 registers

combining predictor of total 8K entry

jmp prediction with 2K entry table


Ilp limit2

ILP Limit

A Realizable Superscalar Processor:

H&P-Text sec.3.3 with rather realistic assumptions

  • 64-issue with no issue restrictions

  • Tournament predictor with 1K entries

  • 16-entry jump return predictor

  • 256 instruction window

  • No alias within window

  • 64 additional renaming registers

    note: no issue restriction is virtually impossible even for lower issue count, say 16.


Ilp limit realistic processor

ILP Limit – Realistic Processor

around 25%


Ilp limit realistic processor1

ILP Limit – Realistic Processor

  • ILP potential in software

  • ILP limited by resources

    • Window size

    • Function unit mismatch

    • Registers

  • ILP limited by dependency

    • Branch prediction

    • False Dependency

      • Output dependency (WAW)

      • Data dependency (RAW)


Cre652 processor architecture course objective to gain

Processor Architecture Comparison (H&P-Text Sec.3.6)


Performance on specint2000

Performance on SPECint2000


Performance on specfp2000

Performance on SPECfp2000


Normalized performance efficiency

Normalized Performance: Efficiency


Superscalar processor

Superscalar processor

N-way Superscalar:

  • Fetch and decode N instructions

  • N “ready” instructions “issued” to function units

    fetch, decode, renaming, dispatch, issue, execution, writeback/commit

    • After issue, execution begins

    • The maximum number of instruction a processor can send simultaneously is the “issue width”.

    • Actual issue rate is much less

  • Fetch=Decode > Issue = Execute > Commit


  • Cre652 processor architecture course objective to gain

    Note: Can we keep going with Superscalar path for better performance?

    • Increase instruction window

      Issue width

      Data path width

      → wire delay become more important factor

      → clustered organization may help

      frequent intra-cluster operations

      infrequent inter-cluster operations

    • Simpler may be better?

      But it does not utilize available on-chip resources fully

      Adapting multiprocessor approach?

      How to control multiprocessors for multiple instructions


    Cre652 processor architecture course objective to gain

    Note: Removing dependency limit

    1. Current practice/convention of programming model imposes unnecessary dependency

    • WAR and WAW through memory

      • because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used

    • specific use of registers

      • loop counter, return address register, stack pointer,

        2. Going beyond data-flow limit

    • Data Value prediction with speculation

      general value prediction; unlikely

      • address value prediction

      • constant/loop index value prediction


    Cre652 processor architecture course objective to gain

    Dealing with Other Walls

    • Memory Wall

      • Faster Multilevel Cache

      • Non-blocking pipelined cache

      • Cache in multicore processor

        • Transaction memory

    • Power Wall

      • Lower driving voltage

      • Allowing errors


    Cre652 processor architecture course objective to gain

    Adding New Functionality

    • Network and I/O related

      • Bypassing OS intervention

    • Multimedia

      • Vector instructions

    • Trusted Computing

      • Trusted Platform Module


  • Login