Block Structured Architecture for Future Processors: Feasible Paradigm

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001

Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions

Introduction • Out-of-order architecture • dynamically schedules independent instructions • Higher ILP through • more powerful processor core • fast instruction delivery • But … this increases the hardware complexity significantly!

Hardware complexity processor core instruction window O (n2) bypass logic long wires [Palacharla et al. 1996] register file many ports [Farkas et al. 1995] fetching fetch bandwidth multiple branches cache access

Solutions processor core • decentralization: • trace processor [Rotenberg et al. ‘97] • multiscalar architecture • [Sohi et al. ‘95] • clusters (Alpha 21264) fetching • bigger units of work: • trace in trace processors • task in multiscalar architecture • block in block-structured ISA • [Melvin and Patt ‘95; Hao et al. ‘96]

Basic idea of BSA • Fixed-Length Block Structured Architecture (BSA) • addresses • processor core problem • fetching problem • by appropriate microarchitectural and implementational • design decisions BSA is a feasible architectural paradigm for future processors

BSA-block (p1) (~p1) basic block basic block (p2) (~p2) basic block basic block Block Structured Architecture overcoming the fetch problem • Advantages: • predication: elimination of unbiased branches • intra-block communication: less register file ports required • fixed-length BSA-blocks: easier fetching • Disadvantages: • BSA-block not always filled • higher memory bandwidths • bigger instruction caches • BSA-block compression basic block BSA-block is atomic unit of work • no control flow • predication • static register renaming • data-flow execution • fixed-length

instruction cache fetch unit branch predictor block engine block engine block engine block engine FU1 FU2 data cache register file Block Structured Architecture overcoming the processor core problem fixed-length BSA-block speculative execution fast intra-block communication slow inter-block communication instruction window

Decentralization (1) out-of-order architectures with higher levels of ILP: complex design wiring delay will dominate in future technologies • scaling out-of-order architectures • to higher levels of ILP • for future technologies • is infeasible decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects

Decentralization (2) • lower IPC • slower interconnections (1 cycle latency) • bad virtual instruction window utilization • due to higher granularity • higher clock frequency F • decentralization • performance = IPC x F • higher performance for large virtual window sizes

Statistical Modeling extraction of distributions benchmark trace: e.g. SPECint statictical profiler statistical profile: distributions 1 2 microarchitectural parameters 3 BSA-block size b trace-driven simulator synthetic trace synthetic trace generator 5 4 6 IPC

Synthetic BSA-trace Generation generate control flow BSA-block 1 basic block actually executed • determine basic block size • add basic block to most likely execution path • until b instructions in BSA-block 0.65 0.35 2 basic block 4 basic block generate data flow • instruction type • number of operands • age of register operands 0.25 0.40 0.20 0.15 5 basic block 3 basic block • determine actually executed control flow path 0.20 0.05 0.20 0.20

Benchmarks • SPECint95: integer • SPECfp95: floating-point • MediaBench: signal and multimedia processing • MPEG-4 like algorithms • measuring program characteristics through instrumentation (ATOM) on Alpha architecture

Instruction Mix • Load/store instructions • SPECint95 40.6% • SPECfp95 37.7% • multimedia 29.2% • Branch instructions • SPECint95 14.0% • SPECfp95 3.6% • multimedia 8.5% • Some multimedia applications have floating-point instructions

Control-intensitivity • Good measure: “Number of instructions between 2 mispredicted branches” = number of instructions between 2 branches branch misprediction rate • SPECint95 80.1 7.3 9.1% • SPECfp95 415.3 25.0 6.0% • multimedia 156.9 14.3 9.1%

BSA-block formationnumber of useful instructions 100% 90% 80% fraction useful instructions 70% avg media avg SPECint95 60% avg SPECfp95 50% 16 32 64 128 BSA-block size

BSA-block formationpredictability of multi-way branch multimedia integer floating-point 100% 90% 80% 70% 60% multi-way branch predictability 50% 40% 16-instruction block 30% 32-instruction block 20% 64-instruction block 10% 0% • 16-instruction block: 90% in most cases • 32-instruction block: low for several integer applications • 64-instruction block: only for floating-point applications

Conclusions • Multimedia applications are less control-intensive than integer applications • due to larger basic block size under comparable branch predictability • Multimedia applications are more control-intensive than floating-point applications • due to smaller basic block size and lower branch predictability • 16 instructions per BSA-block is appropriate • larger blocks result in higher (multi-way) branch misprediction rates

Block Structured Architecture for Future Processors: Feasible Paradigm

Block Structured Architecture for Future Processors: Feasible Paradigm

Presentation Transcript

.Net Application Domains

Environmental modeling application domains

Application architectures

Concurrent Error Detection Architectures for Symmetric Block Ciphers

Steganalysis of Block-Structured Stegotext

Application Architectures

Mobile Application Architectures

Application of Measurement: Length

Mobile Application Architectures

Application Architectures

Reconfigurable Versus Fixed Versus Hybrid Architectures

Application Architectures

.Net Application Domains

Application architectures

Using Application Domains Effectively

Application Architectures