CSE 502: Computer Architecture

CSE 502:Computer Architecture Superscalar Decode

4-wide superscalar fetch 32-bit inst 32-bit inst 32-bit inst 32-bit inst Decoder Decoder Decoder Decoder decoded inst decoded inst decoded inst decoded inst superscalar Superscalar Decode for RISC ISAs • Decode X insns. per cycle (e.g., 4-wide) • Just duplicate the hardware • Instructions aligned at 32-bit boundaries 1-Fetch 32-bit inst Decoder decoded inst scalar

uop Limits • How many uops can a decoder generate? • For complex x86 insts, many are needed (10’s, 100’s?) • Makes decoder horribly complex • Typically there’s a limit to keep complexity under control • One x86 instruction  1-4 uops • Most instructions translate to 1.5-2.0 uops • What if a complex insn. needs more than 4 uops?

UROM/MS for Complex x86 Insts • UROM (microcode-ROM) stores the uop equivalents • Used for nasty x86 instructions • Complex like > 4 uops (PUSHA or STRREP.MOV) • Obsolete (like AAA) • Microsequencer (MS) handles the UROM interaction

SUB REP.MOV ADD [ ] ADD REP.MOV INC STA ADD [ ] mJCC STD XOR LOAD Cycle 4 Cycle 2 SUB LOAD STORE Cycle 3 Complex instructions, get uops from mcode sequencer Fetch- x86 insts Decode - uops UROM - uops UROM/MS Example (3 uop-wide) XOR ADD REP.MOV REP.MOV REP.MOV STORE ADD [ ] ADD [ ] ADD [ ] … SUB XOR XOR XOR … STORE … mJCC INC … LOAD mJCC … ADD Cycle 5 Cycle n Cycle n+1 Cycle 1

Superscalar CISC Decode • Instruction Length Decode (ILD) • Where are the instructions? • Limited decode – just enough to parse prefixes, modes • Shift/Alignment • Get the right bytes to the decoders • Decode • Crack into uops • Do this for N instructions per cycle!

ILD Recurrence/Loop PCi= X PCi+1 = PCi + sizeof( Mem[PCi] ) PCi+2 = PCi+1 + sizeof( Mem[PCi+1] ) = PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] ) • Can’t find start of next insn. before decoding the first • Must do ILD serially • ILD of 4 insns/cycle implies cycle time will be 4x • Critical component not pipeline-able

Inst 1 Inst 2 Inst 3 Remainder Cycle 2 Cycle 3 Decoder 1 Decoder 2 Decoder 3 Left Shifter Bad x86 Decode Implementation Instruction Bytes (ex. 16 bytes) Length 1 ILD (limited decode) Cycle 1 Length 2 ILD (limited decode) + Length 3 ILD (limited decode) + bytes decoded • ILD dominates cycle time; not scalable

Hardware-Intensive Decode Decode from every possible instruction starting point! ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD ILD Giant MUXes to select instruction bytes Decoder Decoder Decoder

ILD in Hardware-Intensive Approach Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder 6 bytes 4 bytes 3 bytes + Previous: 3  ILD + 2add Now: 1ILD + 2(mux+add) Total bytes decode = 11

Predecoding • ILD loop is hardware intensive • Impacts latency • Consumes substantial power • If instructions A, B and C are decoded • ... lengths for A, B, and C will still be the same next time • No need to repeat ILD • Possible to cache the ILD work

Decoder Example: AMD K5 (1/2) From Memory 8 bits 8 bytes b0 b1 b2 … b7 Predecode Logic +5 bits 13 bytes b0 b1 b2 … b7 8  (8-bit inst + 5-bit predecode) I$ 16  (8-bit inst + 5-bit predecode) Decode Up to 4 ROPs • Compute ILD on fetch, store ILD in the I$

Decoder Example: AMD K5 (2/2) • Predecode makes decode easier by providing: • Instruction start/end location (ILD) • Number of ROPs needed per inst • Opcode and prefix locations • Power/performance tradeoffs • Larger I$ (increase array by 62.5%) • Longer I$ latency, more I$ power consumption • Remove logic from decode • Shorter pipeline, simpler logic • Cache and reused decode work  less decode power • Longer effective I-L1 miss latency (ILD on fill)

Decoder Example: Intel P-Pro • Only Decoder 0 interfaces with the uROM and MS • If insn. in Decoder 1 or Decoder 2 requires > 1 uop • do not generate output • shift to Decoder 0 on the next cycle 16 Raw Instruction Bytes Fetch resteer if needed mROM Decoder 0 Decoder 1 Decoder 2 Branch Address Calculator 4 uops 1 uop 1 uop

Fetch Rate is an ILP Upper Bound • Instruction fetch limits performance • To sustain IPC of N, must sustain a fetch rate of N per cycle • If you consume 1500 calories per day,but burn 2000 calories per day,then you will eventually starve. • Need to fetch N on average, not on every cycle • N-wide superscalar ideally fetches N insns. per cycle • This doesn’t happen in practice due to: • Instruction cache organization • Branches • … and interaction between the two

Instruction Cache Organization • To fetch N instructions per cycle... • L1-I line must be wide enough for N instructions • PC register selects L1-I line • A fetch group is the set of insns. starting at PC • For N-wide machine, [PC,PC+N-1] PC Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst

Fetch Misalignment (1/2) • If PC = xxx01001, N=4: • Ideal fetch group is xxx01001 through xxx01100 (inclusive) PC: xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Line width Fetch group • Misalignment reduces fetch width

Inst Inst Inst Fetch Misalignment (2/2) • Now takes two cycles to fetch N instructions • ½ fetch bandwidth! PC: xxx01001 00 01 10 11 000 Tag Inst Inst Inst Inst 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Decoder Cycle 2 111 PC: xxx01100 Tag Inst Inst Inst Inst 00 01 10 11 000 001 Tag Inst Inst Inst Inst Cycle 1 Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder 111 Tag Inst Inst Inst Inst Inst • Might not be ½ by combining with the next fetch

Cache Line Reducing Fetch Fragmentation (1/2) • Make |Fetch Group| < |L1-I Line| Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder PC Tag Inst Inst Inst Inst Inst Inst Inst Inst • Can deliver N insns. when PC > N from end of line

Reducing Fetch Fragmentation (2/2) • Needs a “rotator” to decode insns. in correct order Tag Inst Inst Inst Inst Inst Inst Inst Inst Tag Inst Inst Inst Inst Inst Inst Inst Inst Decoder PC Tag Inst Inst Inst Inst Inst Inst Inst Inst Rotator Inst Inst Inst Inst Aligned fetch group

CSE 502: Computer Architecture

CSE 502: Computer Architecture

Presentation Transcript

EVERY STUDENT SUCCEEDS ACT (ESSA)

Presenters: Division of Academic Affairs School of Graduate Studies January 21, 2010P

2006.9

cja214newcoursesTutorial /uophelp

BPA 303 UOP TUTORIAL / Uoptutorial

CRJ 305 ash course tutorial/tutorial rank

CJA 383 UOP Course Tutorial/TutorialRank

CJA 444 ASSIST Peer Educator/cja444assistdotcom

CJA 224 help Real Education/cja224help.com

BSA 502 Course Career Path Begins /bsa502dotcom

Herbal Treatment For Internal Piles To Ease Hemorrhoid Pain Naturally

Natural Height Growth Supplements To Become Taller After Puberty

Pradeepshrivastava writer and English novel authors Karnatak

Learning to Code & Computer Coding Courses for Kids & Beginners