210 likes | 286 Views
Explore the Fixed-Length Block Structured Architecture (BSA) paradigm addressing hardware complexity and fetching issues for enhanced processor core efficiency. BSA introduces basic block units with advantages in predication, communication, and instruction cache fetching. Dive into the methodology and results to discover the potential of BSA in overcoming processor core challenges and enhancing overall performance for future technologies.
E N D
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001
Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions
Introduction • Out-of-order architecture • dynamically schedules independent instructions • Higher ILP through • more powerful processor core • fast instruction delivery • But … this increases the hardware complexity significantly!
Hardware complexity processor core instruction window O (n2) bypass logic long wires [Palacharla et al. 1996] register file many ports [Farkas et al. 1995] fetching fetch bandwidth multiple branches cache access
Solutions processor core • decentralization: • trace processor [Rotenberg et al. ‘97] • multiscalar architecture • [Sohi et al. ‘95] • clusters (Alpha 21264) fetching • bigger units of work: • trace in trace processors • task in multiscalar architecture • block in block-structured ISA • [Melvin and Patt ‘95; Hao et al. ‘96]
Basic idea of BSA • Fixed-Length Block Structured Architecture (BSA) • addresses • processor core problem • fetching problem • by appropriate microarchitectural and implementational • design decisions BSA is a feasible architectural paradigm for future processors
BSA-block (p1) (~p1) basic block basic block (p2) (~p2) basic block basic block Block Structured Architecture overcoming the fetch problem • Advantages: • predication: elimination of unbiased branches • intra-block communication: less register file ports required • fixed-length BSA-blocks: easier fetching • Disadvantages: • BSA-block not always filled • higher memory bandwidths • bigger instruction caches • BSA-block compression basic block BSA-block is atomic unit of work • no control flow • predication • static register renaming • data-flow execution • fixed-length
instruction cache fetch unit branch predictor block engine block engine block engine block engine FU1 FU2 data cache register file Block Structured Architecture overcoming the processor core problem fixed-length BSA-block speculative execution fast intra-block communication slow inter-block communication instruction window
Decentralization (1) out-of-order architectures with higher levels of ILP: complex design wiring delay will dominate in future technologies • scaling out-of-order architectures • to higher levels of ILP • for future technologies • is infeasible decentralization small, and thus very fast, block engines communicating through longer, and thus slower, interconnects
Decentralization (2) • lower IPC • slower interconnections (1 cycle latency) • bad virtual instruction window utilization • due to higher granularity • higher clock frequency F • decentralization • performance = IPC x F • higher performance for large virtual window sizes
Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions
Statistical Modeling extraction of distributions benchmark trace: e.g. SPECint statictical profiler statistical profile: distributions 1 2 microarchitectural parameters 3 BSA-block size b trace-driven simulator synthetic trace synthetic trace generator 5 4 6 IPC
Synthetic BSA-trace Generation generate control flow BSA-block 1 basic block actually executed • determine basic block size • add basic block to most likely execution path • until b instructions in BSA-block 0.65 0.35 2 basic block 4 basic block generate data flow • instruction type • number of operands • age of register operands 0.25 0.40 0.20 0.15 5 basic block 3 basic block • determine actually executed control flow path 0.20 0.05 0.20 0.20
Benchmarks • SPECint95: integer • SPECfp95: floating-point • MediaBench: signal and multimedia processing • MPEG-4 like algorithms • measuring program characteristics through instrumentation (ATOM) on Alpha architecture
Outline • Introduction • Block Structured Architecture • Methodology • Results • Conclusions
Instruction Mix • Load/store instructions • SPECint95 40.6% • SPECfp95 37.7% • multimedia 29.2% • Branch instructions • SPECint95 14.0% • SPECfp95 3.6% • multimedia 8.5% • Some multimedia applications have floating-point instructions
Control-intensitivity • Good measure: “Number of instructions between 2 mispredicted branches” = number of instructions between 2 branches branch misprediction rate • SPECint95 80.1 7.3 9.1% • SPECfp95 415.3 25.0 6.0% • multimedia 156.9 14.3 9.1%
BSA-block formationnumber of useful instructions 100% 90% 80% fraction useful instructions 70% avg media avg SPECint95 60% avg SPECfp95 50% 16 32 64 128 BSA-block size
BSA-block formationpredictability of multi-way branch multimedia integer floating-point 100% 90% 80% 70% 60% multi-way branch predictability 50% 40% 16-instruction block 30% 32-instruction block 20% 64-instruction block 10% 0% • 16-instruction block: 90% in most cases • 32-instruction block: low for several integer applications • 64-instruction block: only for floating-point applications
Conclusions • Multimedia applications are less control-intensive than integer applications • due to larger basic block size under comparable branch predictability • Multimedia applications are more control-intensive than floating-point applications • due to smaller basic block size and lower branch predictability • 16 instructions per BSA-block is appropriate • larger blocks result in higher (multi-way) branch misprediction rates