bulldozer an approach to multithreaded compute performance
Download
Skip this Video
Download Presentation
Bulldozer: An Approach to multithreaded Compute Performance

Loading in 2 Seconds...

play fullscreen
1 / 13

Bulldozer: An Approach to multithreaded Compute Performance - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Bulldozer: An Approach to multithreaded Compute Performance . by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15. 마이크로 프로세서 구조 speaker: 박세준. Contents. Motivation Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bulldozer: An Approach to multithreaded Compute Performance' - ivana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bulldozer an approach to multithreaded compute performance
Bulldozer:An Approach to multithreaded Compute Performance

by

Michael Butler, Leslie Barnes,

Debjit Das Sarma, Bob Gelinas

This paper appears in: Micro, IEEE

March/April 2011 (vol. 31 no. 2)

pp. 6-15

마이크로 프로세서 구조

speaker: 박세준

contents
Contents
  • Motivation
  • Introduction
  • Block diagram
  • Key features
  • Function block highlights
  • Bulldozer-based SoC
motivation
Motivation

AMD has been focusing on the core count and highly parallel sever workloads

  • Two basic observations
    • Future SoCs support multiple execution threads
      • The smallest possible building module
    • Core would operate in constrained power environment.
      • Power reduction techniques:

Filtering , speculation reduction, data movement minimization

Performance per watt!!

introduction
Introduction

Bulldozer is New direction in microarchitecture

  • Bulldozer is the first x86 design to share substantial hardware between multiple core
  • Bulldozer is a hierarchical design with sharing at nearly every level
  • Bulldozer is a high frequency optimized CPU
  • Instead of peak performance, average performance increased.
introduction1
Introduction
  • Major contribution
  • Scaling the core structures
  • Aggressive frequency goal
    • low gates per clock
block diagram
Block diagram
  • It combines two independent core as a module
    • implementation of a shared level 2 cache
    • Improved area and power efficiency
  • The module can fetch and decode up to four x86 instruction per clock.
  • Each core can services two loads per cycle.
  • Shared Frontend
  • Decoupled predict and fetch pipelines
block diagram1
Block diagram
  • ALU performance 33% decrease FPU performance 33% increase
  • ALU performance 33% increase FPU performance 33% increase
key features
Key features
  • 1. Multithreading microarchitecture
    • Appropriate use of replication and shared hardware
    • Main advantage to sharing instruction cache and branch
    • Enforcing frontend (increasing ROB, BTB)
  • 2. Decoupled branch-prediction from instruction fetch pipelines
    • Enablement of instruction prefetch using the prediction queue
    • instruction control unit increased 128 (reorder buffer)
  • 3. Register renaming and operand delivery
    • scheduler and operand-handling is the biggest power consumer in the integer execution unit
    • PRF-based renaming microarchitecture for power efficiency
      • Eliminates data replication
  • 4. FMAC and media extension
    • FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth
    • It made one per each module like coprocessor
function block highlights
Function block highlights
  • Branch prediction
  • multilevel BTB
  • Instruction cache
  • 64 Kbyte, two-way set-associative,
  • cache shared between both threads
function block highlights1
Function block highlights
  • Decode
  • branch fusion (intel: macro fusion ), four x86 instruction per cycle
  • Bulldozer execution pipeline
function block highlights2
Function block highlights
  • Integer scheduler and execution
  • renaming by PRF(Physical Register Files)
  • Floating point
  • FPU is a coprocessor between two integer core
  • L2 cache
  • the two cores share the unified L2 cache
bulldozer based soc
Bulldozer-based SoC
  • Summary
  • In single threading, sacrifice peak performance, throughput increase
  • In single threading, FPU is more important
  • ALU performance need in server
  • Bulldozer can deliver a significant performance improvement in the same power.
ad