Bulldozer an approach to multithreaded compute performance
This presentation is the property of its rightful owner.
Sponsored Links
1 / 13

Bulldozer: An Approach to multithreaded Compute Performance PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Bulldozer: An Approach to multithreaded Compute Performance . by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15. 마이크로 프로세서 구조 speaker: 박세준. Contents. Motivation Introduction

Download Presentation

Bulldozer: An Approach to multithreaded Compute Performance

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bulldozer an approach to multithreaded compute performance

Bulldozer:An Approach to multithreaded Compute Performance

by

Michael Butler, Leslie Barnes,

Debjit Das Sarma, Bob Gelinas

This paper appears in: Micro, IEEE

March/April 2011 (vol. 31 no. 2)

pp. 6-15

마이크로 프로세서 구조

speaker: 박세준


Contents

Contents

  • Motivation

  • Introduction

  • Block diagram

  • Key features

  • Function block highlights

  • Bulldozer-based SoC


Motivation

Motivation

AMD has been focusing on the core count and highly parallel sever workloads

  • Two basic observations

    • Future SoCs support multiple execution threads

      • The smallest possible building module

    • Core would operate in constrained power environment.

      • Power reduction techniques:

        Filtering , speculation reduction, data movement minimization

        Performance per watt!!


Introduction

Introduction

Bulldozer is New direction in microarchitecture

  • Bulldozer is the first x86 design to share substantial hardware between multiple core

  • Bulldozer is a hierarchical design with sharing at nearly every level

  • Bulldozer is a high frequency optimized CPU

  • Instead of peak performance, average performance increased.


Introduction1

Introduction

  • Major contribution

  • Scaling the core structures

  • Aggressive frequency goal

    • low gates per clock


Block diagram

Block diagram

  • It combines two independent core as a module

    • implementation of a shared level 2 cache

    • Improved area and power efficiency

  • The module can fetch and decode up to four x86 instruction per clock.

  • Each core can services two loads per cycle.

  • Shared Frontend

  • Decoupled predict and fetch pipelines


Block diagram1

Block diagram

  • ALU performance 33% decrease FPU performance 33% increase

  • ALU performance 33% increase FPU performance 33% increase


Key features

Key features

  • 1. Multithreading microarchitecture

    • Appropriate use of replication and shared hardware

    • Main advantage to sharing instruction cache and branch

    • Enforcing frontend (increasing ROB, BTB)

  • 2. Decoupled branch-prediction from instruction fetch pipelines

    • Enablement of instruction prefetch using the prediction queue

    • instruction control unit increased 128 (reorder buffer)

  • 3. Register renaming and operand delivery

    • scheduler and operand-handling is the biggest power consumer in the integer execution unit

    • PRF-based renaming microarchitecture for power efficiency

      • Eliminates data replication

  • 4. FMAC and media extension

    • FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth

    • It made one per each module like coprocessor


Function block highlights

Function block highlights

  • Branch prediction

  • multilevel BTB

  • Instruction cache

  • 64 Kbyte, two-way set-associative,

  • cache shared between both threads


Function block highlights1

Function block highlights

  • Decode

  • branch fusion (intel: macro fusion ), four x86 instruction per cycle

  • Bulldozer execution pipeline


Function block highlights2

Function block highlights

  • Integer scheduler and execution

  • renaming by PRF(Physical Register Files)

  • Floating point

  • FPU is a coprocessor between two integer core

  • L2 cache

  • the two cores share the unified L2 cache


Bulldozer based soc

Bulldozer-based SoC

  • Summary

  • In single threading, sacrifice peak performance, throughput increase

  • In single threading, FPU is more important

  • ALU performance need in server

  • Bulldozer can deliver a significant performance improvement in the same power.


Bulldozer an approach to multithreaded compute performance

  • The end


  • Login