efficient execution of single thread programs across multiple cores n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient Execution of Single-thread Programs across Multiple Cores PowerPoint Presentation
Download Presentation
Efficient Execution of Single-thread Programs across Multiple Cores

Loading in 2 Seconds...

play fullscreen
1 / 103

Efficient Execution of Single-thread Programs across Multiple Cores - PowerPoint PPT Presentation


  • 103 Views
  • Updated on

Efficient Execution of Single-thread Programs across Multiple Cores. Behnam Robatmili Supervisor: Doug Burger UT Austin July 21, 2011. Need for Efficiency and Flexibility. AMD Llano (Fusion). Single thread efficiency and scalability with multicore Amdahl’s law

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Efficient Execution of Single-thread Programs across Multiple Cores


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    efficient execution of single thread programs across multiple cores

    Efficient Execution of Single-thread Programs across Multiple Cores

    Behnam Robatmili

    Supervisor: Doug Burger

    UT Austin

    July 21, 2011

    need for efficiency and flexibility
    Need for Efficiency and Flexibility

    AMD Llano (Fusion)

    • Single thread efficiency and scalability with multicore
      • Amdahl’s law
      • Power wall limits frequency scaling and efficiency
        • Parallel efficiency is only possible with efficiency of each thread at each execution point
        • Need more efficient methods to span performance/energy
    • Heterogeneous multicores and DVFS for efficiency and flexibility
      • Various ISAs and design overheads
      • Not as flexible as we want
      • Need more innovative solutions!

    Intel Sandy Bridge

    2

    one alternative dynamic composable multicore
    One Alternative: Dynamic (Composable) Multicore
    • For each thread, multiple simple cores
      • Share resource and form a more powerful core
      • Span a wide range of energy/performance operating points
    • Potentially they can achieve high performance with low energy budget and also operate in a low-power regime

    Inter-core control communication

    L1

    L1

    Inter-core data communication

    RF

    RF

    L1

    RF

    BP

    BP

    BP

    L1

    L1

    RF

    RF

    BP

    BP

    handling distributed dependences
    Handling Distributed Dependences
    • CoreFusion [Micro07] and WiDGET [ISCA11]
      • Dynamically distributed execute across multiple cores
      • Need for power-hungry central units for maintaining control sequence and register renaming across distributed instructions
    • With ISA support compiler can reduce these overheads (EDGE)

    Inter-core control communication

    L1

    L1

    Inter-core data communication

    RF

    RF

    L1

    RF

    BP

    BP

    BP

    L1

    L1

    RF

    RF

    BP

    BP

    edge isas

    RISC

    EDGE

    Atomic

    unit

    L1:

    L2:

    add

    mul

    ld

    sub

    tlt

    jlt L2

    fadd

    fmul

    ld

    sd

    jump L1

    L1:

    L2:

    add

    mul

    ld

    sub

    tlt

    jlt L2

    fadd

    fmul

    ld

    sd

    jump L1

    R1

    R0

    ld

    Register File

    ld

    muli

    Atomic

    unit

    muli

    add

    br

    sd

    add

    R0

    EDGE ISAs
    • Block atomic execution (predicated blocks)
      • Instruction groups fetch, execute, and commit atomically
    • Direct instruction communication (Dataflow)
      • Explicitly encode dataflow graph by specifying targets
    • Enables efficient execution and low-overhead distribution
    ultimate goal

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    C

    Ultimate Goal

    RF

    L1

    • Grid of many homogeneous thin low-power, high performance cores connected via on-chip mesh network
    • Single thread efficiency (energy-delay-product):Logical cores are composed of multiple physical cores in a scalable and low-overhead way
    • Multithread efficiency: Logical cores can be composed / decomposed at runtime according to runtime policies one of which is composition

    ALU

    BP

    L2

    L2

    L2

    L2

    T0

    T1

    c

    c

    c

    c

    T0

    T1

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    T2

    c

    c

    c

    c

    T2

    T3

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    T4

    T5

    c

    c

    c

    c

    T4

    T5

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    Compose

    Recompose

    L2

    L2

    L2

    L2

    c

    c

    c

    c

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    T7

    L2

    L2

    L2

    L2

    T6

    c

    c

    c

    c

    T6

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    c

    c

    c

    c

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    T3

    c

    c

    c

    c

    T7

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    c

    c

    c

    c

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    L2

    thesis statement
    Thesis Statement
    • Composable processors can potentially perform well in different power/energy regimes
    • Design a power-efficient and scalable dynamic multicore called T3 spanning a wide energy/performance spectrum
      • Evaluating prior EDGE composable design (TFlex)
      • By re-inventing EDGE architectures to address inefficacies and improve execution efficiency in different power/performance regimes
    outline
    Outline
    • Motivation
    • Background on TFlex and bottleneck analysis
    • Re-inventing architecture for power efficient EDGE composable execution
      • Mapping blocks across simple cores
      • Optimizing cross-core register communication
      • Reducing fetch stall bottleneck
      • Optimizing prediction and predication
      • Optimizing single-core operand delivery
    • Evaluation and conclusions
    baseline tflex composable system
    Baseline TFlexComposable System

    R

    D

    R

    D

    R

    D

    R

    D

    BP

    ALU

    ALU

    BP

    ALU

    BP

    BP

    ALU

    R

    D

    R

    D

    R

    R

    D

    D

    ALU

    ALU

    BP

    BP

    ALU

    ALU

    BP

    BP

    R

    D

    R

    D

    R

    D

    R

    D

    ALU

    ALU

    ALU

    BP

    BP

    BP

    ALU

    BP

    R

    • TFlex is an EDGE composable processor
    • N cores are merged
        • Share registers, branch tables, cache…
      • Run N blocks in parallel (1 nonspec)
    • Dataflow within blocks and shared distributed registers across blocks

    D

    R

    D

    R

    D

    R

    D

    ALU

    ALU

    BP

    BP

    BP

    ALU

    BP

    ALU

    TFlex

    2-wide issue execution per core

    t3 composable processor
    T3 Composable Processor

    OneT3 Core

    R

    D

    R

    D

    R

    D

    R

    D

    BP

    ALU

    ALU

    BP

    ALU

    BP

    BP

    ALU

    Block control & reissue unit

    Register bypassing

    R

    D

    R

    D

    R

    R

    D

    D

    8-Kbit block/predicate predictor

    2

    ALU

    ALU

    BP

    BP

    ALU

    ALU

    BP

    BP

    R

    D

    R

    D

    R

    D

    R

    D

    ALU

    ALU

    ALU

    BP

    BP

    BP

    ALU

    BP

    Broadcast/token

    select logic

    Broadcast/token

    select logic

    Block mapping unit

    R

    • Analysis using systematic bottleneck analysis based on critical path analysis [HPCA11]
    • For each step, redesign architecture/ISA to reduce the most dominant execution bottleneck
      • Primarily aiming for performance
      • Using EDGE semantics to save energy

    D

    R

    D

    R

    D

    Spec. predicate

    R

    D

    ALU

    ALU

    BP

    BP

    BP

    ALU

    BP

    ALU

    T3

    2-wide issue execution per core

    systematic bottleneck analysis and reduction
    Systematic Bottleneck Analysis and Reduction
    • Analyzing complex distributed systems is complicated
    • Our methodology
      • Using a system-level critical path analysis to detect top bottleneck
      • At a component level, detect the scenario causing that bottleneck
      • Design the right optimization mechanism based on the detected scenario and repeat
    • Bottlenecks and mechanisms presented in the order detected

    System-level analysis

    Detecting Bottleneck Component

    Component-level analysis

    Detecting the scenario causing the bottleneck

    Choose and apply the right optimization mechanism

    outline1
    Outline
    • Motivation
    • Background
    • Re-inventing architecture for power efficient EDGE composable execution
      • Mapping blocks across simple cores
      • Optimizing cross-core register communication
      • Reducing fetch stall bottleneck
      • Optimizing prediction and predication
      • Optimizing single-core operand delivery
    • Full T3 system evaluation and conclusions
    reducing fine grain dataflow communication
    Reducing Fine-Grain Dataflow Communication
    • Flat mapping (Original)
      • Each core runs a portion of each running block
      • Intra-block dataflow communication is a bottleneck [LCPC08, PACT08]
    • Deep mapping [MICRO08]
      • Maps each block to one core
      • Halves cross-core communication by limiting dataflow to single cores
      • Dynamically select the cores for mapping blocks

    Flat Mapping

    Deep Mapping

    reducing coarse grain register communication
    Reducing Coarse-Grain Register Communication
    • Cross-core register communication between blocks turns to bottleneck
      • Distributed registers forwarding units resolve register dependences
    • Selective register bypassing: Late values via very low-overhead direct register value bypassing while the rest use register forwarding [HPCA11]

    B0 to B1 via register R1

    B0

    Home core of R1

    D

    D

    D

    D

    B1

    R

    R

    R

    R

    Selective register bypassing for critical register values (direct)

    Original register value forwarding via register home core (indirect)

    reducing fetch criticality
    Reducing Fetch Criticality
    • Block fetches following flushes are critical
    • There is no control flow with each block so blocks can be reissued if they are still in the window
    • Reissuing critical instructions following pipeline flushes to reduce misspeculation penalty [HPCA11]
    • Saves energy & delay by reducing fetches and decode by about 50%
    outline2
    Outline
    • Motivation
    • Background
    • Re-inventing architecture for power efficient EDGE composable execution
      • Mapping blocks across simple cores
      • Optimizing cross-core register communication
      • Reducing fetch stall bottleneck
      • Optimizing prediction and predication
      • Optimizing single-core operand delivery
    • Full T3 system evaluation and conclusions
    next block prediction
    Next Block Prediction
    • To scale performance, multiple speculative blocks in flight
    • Coarse-grain branch prediction
    • When a block is fetched the block following it is predicted

    B0

    B1

    B2

    B4

    B3

    B5

    B6

    Predicted block path

    B0, B1, B4, B5, B6

    edge speculation predication overheads
    EDGE Speculation & Predication Overheads
    • Intra-block control points converted to predicates
      • Multi-exit next block prediction accuracy (i.e. Exit 1..3)
        • The branch history does not include predicates (i.e. i1 and i3)
        • TFlex predictor predicts the exit ID bits of the block
      • Predicates executed not predicted (i.e. ST waits for R1)

    B1

    R1

    R2

    1

    B2

    TZ

    B3

    i1

    i2

    i3

    SUBI

    BR B2

    TZ

    ST

    i4

    Exit 3

    Exit 2

    On false

    BR B3

    BR B1

    Exit 1

    On True

    iterative path prediction ipp
    Iterative Path Prediction (IPP)
    • Solution [submitted to MICRO11]: Predict predicate path within each block
      • Use it to better predict exit and speculate on predicates
      • Example: i1i3 path: “11”  Take Exit 2, skip all instructions “00”  Take Exit 1, only execute ST

    B1

    R1

    R2

    1

    B2

    TZ

    B3

    i1

    SUBI

    i2

    i3

    BR B2

    TZ

    ST

    i4

    Exit 3

    Exit 2

    On false

    BR B3

    BR B1

    Exit 1

    On True

    ipp advantages
    IPP Advantages
    • Accurate next block prediction
    • Speculative execution of predicates

    Current block address

    Next Block Target

    Predicted Branches in the block

    Predicting next block

    Predicate Prediction

    Target Prediction

    Speculatively executing dataflow predicate path in the block

    ipp predicate predictor component
    IPP Predicate Predictor Component
    • Pipelined OGEHL [Seznec JILP04] predictor
      • Only one hashing stage for all instructions in the block
      • High accuracy using hazard elimination

    1-bit prediction

    Spec update

    4-bit counters

    TO

    L(0)

    L(1)

    GHR

    T1

    H1

    200

    bits

    L(2)

    T2

    H1

    Predicted path

    L(3)

    T3

    H2

    L(4)

    T4

    H3

    H4

    7-bit

    indexes

    40 bits

    Block PC

    Prediction sum

    Initial index compute

    Table access

    tuning ipp
    Tuning IPP
    • Design parameter: # of predicted predicates/block
      • Accuracy of next block prediction
      • Accuracy of predicate prediction
      • Speedup
    • Optimum point: Predicting 4 predicates per block
      • 14% improved performance (16 merged cores)
      • 11% from predicate prediction + 3% from better next block prediction
    outline3
    Outline
    • Motivation and Background
    • Rethinking compiler and hardware to efficiently exploit thin EDGE cores
      • Distributing computation across simple cores
      • Optimizing cross-core communication
      • Reducing fetch stall bottleneck
      • Optimizing prediction and predication
      • Optimizing single-core operand delivery
    • Full T3 system evaluation and conclusions
    edge dataflow high fanout issue
    EDGE Dataflow High-Fanout Issue
    • Using EDGE dataflow, each instruction can encode up to 2 targets
      • Efficient for low-fanout
      • Trees of move instructions inserted by compiler for high fanout operands (20% of all instructions!)
    • Dynamically generated and matched broadcast tags and bypass networks used by out of order machines
      • High power consumption
      • Not efficient for low-fanout operands

    R1

    mov

    5

    MULI

    ADD

    MUL

    ADD

    mov

    ADD

    ST

    BR B2

    architecturally exposed operand broadcast eob
    Architecturally Exposed Operand Broadcast (EOB)
    • Joint work with Dong Li (lead author) [PESPMA09]
    • For low-fanout operands uses dataflow
    • For high-fanout operands uses light-weight exposed operand broadcasts
      • Simple microarchitectural support
      • Source and destination EOBs (tags) are explicitly assigned to instructions
      • Assigned statically, resolved dynamically
      • Most moves eliminated and 5% fewer blocks executed (10% less energy)

    R1

    1

    5

    1

    MULI

    1

    ADD

    1

    2

    MUL

    2

    ADD

    2

    ADD

    2

    ST

    BR B2

    Exposed operand broadcast

    Dataflow

    An interesting compiler problem to select instructions for limited EOBs

    summary of contributions
    Summary of Contributions

    Communication (Back end)

    Speculation (Front end)

    Non-critical

    Next Block Prediction

    [Trace prediction ]

    Block Reissue

    [Trace cache, Inst. reissue]

    Critical

    Multi core

    Cross-core Register Forwarding Units

    [Distributed memory]

    Direct Register Bypassing

    [TLS, LSQ bypassing]

    Distributing execution using Block Mapping

    Predicate Path Prediction

    [Predicate prediction for out of order]

    Dataflow

    [Forwardflow, Accelerators]

    Exposed Broadcasts

    [Forwardflow, Hybrid wakeup]

    Single core

    Low fanout

    High fanout

    outline4
    Outline
    • Motivation
    • Background
    • Re-inventing architecture for power efficient EDGE composable execution
      • Mapping blocks across simple cores
      • Optimizing cross-core register communication
      • Reducing fetch stall bottleneck
      • Optimizing prediction and predication
      • Optimizing single-core operand delivery
    • Full T3 system evaluation and conclusions
    simulation setup
    Simulation Setup
    • Accurate delay and power analysis comparison with TRIPS and TFlex
      • Cycle/power simulator validated against TRIPS hardware
      • Power models validated against real hardware and RTL [IEEE Computer, in revision]
      • For memory models, used CACTI
      • Technology: 45nm, Vdd: 1.1 Volt, Frequency: 2.4 GHz
    • Comparison with Core2 and Atom platforms in different DVFS regions
      • Real hardware reported power and performance results reported by H. Esmeilzadeh [ASPLOS11]
      • Estimated L1 + Core Power using McPAT [ISCA10]
    spec int performance energy results
    SPEC INT Performance/Energy Results

    Performance

    Energy

    T3

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Pollack’s

    TFlex-8 is close to TRIPS while T3-8 outperforms TRIPS by 1.43 and 25% less energy

    results breakdown
    Results Breakdown

    Major delay savers: IPP, block mapping and block reissue

    Major energy savers: EOBs, block mapping and block reissue

    Performance

    Energy

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    spec int cross platform comparison
    SPEC INT Cross-Platform Comparison

    P

    Few cores (1 to 2)  Energy efficient with high performance

    More cores (4 to 8)  Increased performance for low energy cost

    P

    Performance

    Energy

    E

    E

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Efficiently covering much larger operation spectrum than DVFS

    spec fp performance energy results
    SPEC FP Performance/Energy Results

    Energy

    Performance

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Pollack’s

    spec fp cross platform comparison
    SPEC FP Cross-Platform Comparison

    P

    E

    Energy

    Performance

    P

    E

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Significantly improved performance and energy efficiency compared to INT

    summary of bottleneck analysis
    Summary of Bottleneck Analysis

    Branches & predicates

    Fetch

    Fine-grain dataflow communication

    Coarse-grain register communication

    related work
    Related Work
    • Distributed uniprocessors
      • Dynamic
        • CoreFusion[ISCA08], Forwardflow & WiDGET[ISCA10]
      • Static
        • Instruction Level Distributed Processing [ISCA02], Wavescalar [MICRO03] , Multiscalar [ISCA95], TLS [IEEE Comp. 99]
    • Efficiency optimization
      • Instruction mapping
        • RAW [IEEE Comp. 97], Clustered superscalar [ISCA03]
      • Instruction Reuse
        • Trace processors [MICRO96], Instruction revitalization [MICRO03]
      • Register/Memory Bypassing
        • Memory bypassing and cloaking [IJPP99], TLS synch. scoreboard [IJPP03]
    • Critical path analysis
      • Original paper [ISCA01], TRIPS criticality analysis [ISPASS06]
    conclusions
    Conclusions
    • Rethinking traditional execution models for better efficiency and flexibility for future systems
    • This study: Achieving an efficient EDGE composable system
    • Methodology
      • Systematic bottleneck analysis
      • Balancing communication and execution by specializing communication and speculation at different levels of hierarchy
      • Achieve the right division between EDGE hardware and software
    • Further optimizations are still possible (E2 system)
      • Better code quality
      • Instruction packing (variable block sizes and SIMD/Vector instructions)
      • How composability can improve multithread efficiency?
    acknowledgements
    Acknowledgements
    • My advisor: Doug Burger
    • My committee members: Kathryn McKinley, Steve Keckler, Calvin Lin and Steve Reinhardt
    • My collaborators: Katie Coons, Bert Maher, Aaron Smith [Compiler], JeffDiamond [TRIPS BLAS],Dong Li [EOBs], HadiEsmaeilzadeh[IPP] and SibiGovindan[Power]
    • Other colleagues for their significant comments and advice: Boris Grot, Mark Gebhart …
    • CART Lab and Speedway Group
    • UTCS and MSR
    slide39

    Thank you

    T3

    TRIPS

    TFlex

    Efficiency

    (Future technology)

    Flexibility

    (Little pieces merging)

    Flexibility + Efficiency

    Terminator/EDGE Analogy

    list of backup slides
    List of Backup Slides
    • Compiler
    • Comparison plat forms
    • Energy-delay Product
    • Power Breakdown
    • E2
    • uARCH Suppor and Tuning for EOBs
    • TRIPS vs. TFlex
    • Iterative Path Predictor
    • Block Mapping
    • EDGE Background
    • Next Block Prediction Issue
    • Block Reissue
    • Limit Study & Research Interests
    • Final Criticality Results
    limit study
    Limit Study
    • Chances for additional uArch improvements
      • Under ideal speculative execution
        • Perfect predicate, dependence and branch prediction
      • Only when all applied T3 observes significant speedup
    research interests
    Research Interests
    • Redesigning computer system for Efficiency, Power, Security, Resiliency
    • Redesign the entire hw/sw stack layers wrt workload and factor under optimization
      • Synchronization & communication according to workload
    • More systematic and intelligent ways to redesign systems
      • Machine learning & criticality analysis for specializing important operational modes

    New technologies: NVRAMs, nano, etc

    New workloads: Data centric, Augmented Reality, NUI, gaming, Cloud, etc

    Application

    Factor under optimization: Power, performance, Security, Resiliency

    Programming Language

    Optimization method

    OS and Runtime

    Architecture (ISA)

    uArchitecture

    Circuit

    systematic bottleneck analysis and reduction1
    Systematic Bottleneck Analysis and Reduction

    System-level analysis

    Detecting Bottleneck Component

    Component-level analysis

    Detecting the scenario causing the bottleneck

    Choose and apply the right optimization mechanism

    block reissue
    Block Reissue
    • Reducing misspeculation penalty [HPCA11]
    • Reissuing critical instructions following pipeline flushes
    • Saves energy & delay by reducing fetches by about 50%

    Percent of the reissued blocks

    introduction to e2
    Introduction to E2
      • 128 entry instruction window
      • Divided into 4 lanes
      • Each lane can execute an independent (32 instruction) hyperblock
      • Vector instruction onlytarget instructions in the same lane
    • 64 general-purpose registers
    • 32kB L1 instruction and Dcaches

    Control

    Lane 1

    Instruction Window

    32 x 54b

    Registers[0-15]

    16 x 64b

    Operand

    Buffer

    32 x 64b

    Operand

    Buffer

    32 x 64b

    ALU

    L1 Instruction Cache

    32 KB

    Lane 2

    Registers [16-31]

    16 x 64b

    Instruction Window

    32 x 54b

    Operand

    Buffer

    32 x 64b

    Operand

    Buffer

    32 x 64b

    ALU

    Lane 3

    Instruction Window

    32 x 54b

    Registers [32-47]

    16 x 64b

    Operand

    Buffer

    32 x 64b

    Operand

    Buffer

    32 x 64b

    ALU

    Unordered

    Load/Store Queue

    Lane 4

    Instruction Window

    32 x 54b

    Registers [48-63]

    16 x 64b

    Operand

    Buffer

    32 x 64b

    Operand

    Buffer

    32 x 64b

    ALU

    Branch Predictor

    L1 Data Cache

    32 KB

    Memory Interface Controller

    msr e2 dynamic multicore system
    MSR E2 Dynamic Multicore System
    • Variable-size blocks and SIMD/vector operations in addition to T3 optimizations
    • 4 lanes each with an ALU, one bank of (inst. window, register file and operand buffer)
    • Support up to 4 blocks per core and fine-grained SIMD/vector operations
    generating large blocks
    Generating Large Blocks

    Hyperblock

    Flow Graph

    • Compiler can generate larger blocks by converting control dependences to data dependences through predication
    • Each color represents a basic block
    • RD, WR: inter block communications
    • Dataflow inter block communication:

    RD

    RD

    HB1

    tstlti

    RD

    f

    t

    -

    +

    HB2

    HB3

    *

    +

    ST

    tstz

    HB4

    WR

    BR2

    f

    t

    WR

    BR1

    scale compiler for trips
    Scale Compiler for TRIPS

    Hyperblockformation

    If-conversion

    Loop peeling

    While loop unrolling

    Predicate optimizations

    RD

    RD

    tstlti

    RD

    f

    t

    -

    +

    Register Allocation

    Splitting for Spill Code

    *

    +

    ST

    tstz

    Scheduling (placement)

    WR

    BR2

    f

    t

    WR

    BR1

    Dataflow code generation

    Optimizing compiler for TRIPS that compiles all SPEC benchmarks

    challenges to be addressed
    Challenges to be Addressed

    Predicted Block Path

    • Executing a stream of blocks
      • Oldest to youngest (commit order)
    • Fundamental challenges
      • How instructions are mapped to hardware?
      • How instructions should communicate?
      • How to support inter- and intra-block speculation?
    • Division between the compiler and hardware

    Reg File

    Hardware

    Reg File

    hierarchical communication model
    Hierarchical Communication Model

    Communication (Back end)

    Speculation (Front end)

    Non-critical

    Critical

    Inter block

    Intra block

    Low fanout

    High fanout

    compiling for edge
    Compiling for EDGE

    Hyperblock

    Flow Graph

    Hyperblockformation

    If-conversion

    Loop peeling

    While loop unrolling

    Predicate optimizations

    RD2

    HB1

    tstlti

    RD1

    f

    t

    -

    Register Allocation

    Splitting for Spill Code

    HB2

    HB3

    *

    BR2

    +

    tstz

    BR1

    BR3

    Dataflow code generation

    t

    f

    WR2

    WR1

    HB4

    dataflow

    Predicate

    Block exit branches (control)

    Block register outputs (data)

    Block register inputs (data)

    Block exit branches (control)

    need for efficiency and flexibility1
    Need for Efficiency and Flexibility
    • End of Denard scaling
      • Over 5 technology generations (2024), only 7.9x speedup is possible (CPU or GPU)
      • At 8nm, 50% of chip will not be utilized [ISCA11]
      • Need more efficient cores using radical architectural innovations
        • Save delay and power together
        • Maximize efficiency for each and across threads
    • Supporting future workloads without heterogeneous ISA overheads

    Nehalem

    Llano

    Tigra

    support for exposed operand broadcast
    Support for Exposed Operand Broadcast

    Send BCID = 001

    Type = op1

    operand 2

    p

    target1

    op2

    operand 1

    target2

    op

    op1

    issued

    BC CAM

    a

    i1

    SBCID=1

    add

    b

    (BCID, type, value)

    i5 , op2

    i2

    001

    sub

    d

    i5 , op1

    i3

    001

    add

    g

    a

    Send BCID

    [2-0]

    BC CAM

    i4

    001

    st

    b

    RBCID [2-0]

    3

    i5

    B

    000

    st

    RBCIDv

    3

    Issued

    =

    =

    =

    match

    tuning eobs
    Tuning EOBs
    • Design parameters: Number of EOBs
      • Larger number of available EOBs more moves removed but wider EOB CAMs used by the bypass network
      • Optimum point: 8 EOBs (3-bit wide) for minimum overhead
    static scheduling overview

    Legend

    Static Scheduling Overview

    Hyperblockformation

    If-conversion

    Loop peeling

    While loop unrolling

    Predicate optimizations

    R2

    Static Placement,

    Dynamic Issue

    add

    mul

    br

    ld

    ld

    D0

    ctrl

    D1

    R1

    ctrl

    R2

    R1

    R1

    mul

    Register Allocation

    Splitting for Spill Code

    D0

    add

    mul

    Dataflow

    Graph

    add

    D1

    ld

    mul

    add

    W1

    br

    Scheduling (placement)

    R1

    ctrl

    R2

    R1

    D0

    Placement

    D1

    Register

    Data cache

    Execution

    Control

    Topology

    128! Scheduling possibilities

    int performance energy results
    INT Performance/Energy Results

    T3

    # of cores

    # of cores

    Energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Pollack’s

    results breakdown1
    Results Breakdown

    # of cores

    # of cores

    Energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    fp performance energy results
    FP Performance/Energy Results

    # of cores

    # of cores

    Speedup over single dual-issue cores

    Energy consumed over single dual-issue cores

    Pollack’s

    need for efficiency and flexibility2
    Need for Efficiency and Flexibility
    • End of multicore scaling!
      • Moore’s law end at most after 5 generations (8nm)
      • By then (8nm in 2024), only 4x to 7.9x speedup is possible using multicores (CPU or GPU)
      • 10% to 50% of chip will not be utilized [Esmaeilzadeh ISCA11]
      • Need more efficient cores using radical architectural innovations
        • Save delay and power together
        • Maximize efficiency for each thread and across threads

    H. Esmaeilzadeh [ISCA11]

    edge uarchitectures timeline
    EDGE uArchitectures Timeline

    TRIPS (UT)

    Distributed EDGE uArch

    TFlex (UT)

    Fully distributed registers, caches, control

    T3 (UT)

    Reinvented uArch and ISA for power &performance efficiency

    E2 (MSR)

    Support for variable block sizes and SIMD/vector mode

    ISA uArch Reinvention

    Evaluation

    Initial Implementation

    00

    05

    10

    15

    Scope of the talk

    static vs dynamic dependences
    Static vs Dynamic Dependences

    Hyperblock

    Flow Graph

    Hyperblockformation

    If-conversion

    Loop peeling

    While loop unrolling

    Predicate optimizations

    RD2

    HB1

    tstlti

    RD1

    f

    t

    -

    Register Allocation

    Splitting for Spill Code

    HB2

    HB3

    *

    BR2

    +

    tstz

    BR1

    BR3

    f

    Dataflow code generation

    t

    WR2

    WR1

    HB4

    Up to 128 instructions in each block

    Dynamically detected inter-block

    Statically detected intrablock

    Dataflow link (data)

    Predicated on true (control)

    Predicated on false (control)

    Block register outputs (data)

    Block register inputs (data)

    Block exit branches (control)

    microarchitectural topology

    R0

    R1

    R2

    R3

    E0

    E1

    E2

    E3

    E4

    E5

    E6

    E7

    E8

    E9

    E10

    E11

    E12

    E13

    E14

    E15

    Microarchitectural Topology

    Register File

    R

    D

    R

    D

    R

    D

    R

    D

    BP

    BP

    ALU

    ALU

    BP

    ALU

    BP

    BP

    ALU

    R

    D

    R

    D

    R

    R

    D

    D

    D0

    ALU

    ALU

    BP

    BP

    ALU

    ALU

    BP

    BP

    D1

    R

    D

    R

    D

    R

    D

    R

    D

    Data Cache

    ALU

    ALU

    ALU

    BP

    BP

    BP

    ALU

    BP

    D2

    R

    D

    R

    D

    R

    D

    R

    D

    D3

    ALU

    ALU

    BP

    BP

    BP

    ALU

    BP

    ALU

    TRIPS

    TFlex, T3 and E2

    1-wide issue execution per E tile

    2-wide issue execution per core

    basic ogehl predictor
    Basic OGEHL predictor
    • Has relatively long delay
    • N predicate bits  delay: 3N cycles

    Spec

    update

    1-bit prediction

    Spec update

    T0

    T1

    GHR

    H1

    L(0)

    200

    bits

    T2

    H1

    L(1)

    Predicted path

    T3

    H2

    L(2)

    T4

    H3

    L(3)

    8-bit

    indexes

    4-bit counters

    H4

    L(4)

    40 bits

    Block PC

    pipelined ogehl predictor
    Pipelined OGEHL Predictor
    • Improved delay: N + 2 cycles
    • Possible hazards

    Hazards

    1-bit prediction

    Spec update

    4-bit counters

    L(0)

    T0

    L(1)

    T1

    GHR

    H1

    200

    bits

    T2

    L(2)

    H1

    Predicted path

    L(3)

    T3

    H2

    L(4)

    T4

    H3

    H4

    8-bit

    indexes

    40 bits

    Block PC

    Initial index compute

    Table access

    Prediction sum

    pipelined with bypassing 2w tables
    Pipelined with Bypassing (2w Tables)

    1-bit prediction

    Spec update

    4-bit counters

    TO

    L(0)

    L(1)

    GHR

    T1

    H1

    200

    bits

    L(2)

    T2

    H1

    Predicted path

    L(3)

    T3

    H2

    L(4)

    T4

    H3

    H4

    7-bit

    indexes

    40 bits

    Block PC

    Prediction sum

    Initial index compute

    Table access

    aggressive pipeline 7w tables
    Aggressive Pipeline (7w tables)
    • Low delay: N/3 + 2 cycles
    • More complex logic
    • Possible aliasing

    3-bit prediction

    Spec update

    7 x 4-bit

    counters

    L(0)

    T0

    L(1)

    T1

    GHR

    H1

    200

    bits

    L(2)

    T2

    1

    0

    H1

    Predicted path

    L(3)

    T3

    H2

    11

    L(4)

    T4

    10

    00

    10

    H3

    H4

    5-bit

    indexes

    40 bits

    Block PC

    Prediction sum

    Initial index compute

    Table access

    iterative path predictor results
    Iterative Path Predictor Results
    • Improves branch prediction accuracy
      • MPKI goes from 4.3 to 3.6
      • Number of flushed blocks goes from 12% to 8%
      • 5% energy saving for each core
    • Speedup execution by speculating on predicate paths
      • 12% speedup when using 16 cores
      • 98% accurately for path prediction
    next block prediction1
    Next Block Prediction
    • Predicting next block
      • Multiple branches per block
      • The taken branch depends on executed predicate path
      • Target correlation
      • Exit IDs vs. predicate path
    • Predict taken predicate path and use it to predict next block
    • Speculatively execute predicate path

    RD2

    tstlti

    RD1

    f

    t

    -

    *

    BR2

    +

    tstz

    BR1

    f

    BR3

    t

    WR2

    WR1

    Predicted block path

    Taken predicated path

    using edge for efficiency and composition
    Using EDGE for Efficiency and Composition
    • EDGE ISA and uArchitectures promise composition
      • Efficiency: dataflow and block atomicity
      • Flexibility: Using distributed microarchitectures
    • Looking back into early EDGE designs and revisit the basics using a systematic methodology
    • Proposing a new design to fulfill these goals
      • Systematic bottleneck analysis and removal (not covered)
      • Design space exploration for power efficiency
        • Balancing computation and communication in a hierarchical manner: Distribution, Communication Speculation, operand delivery
        • Balanced division of compiler and hardware
      • Complete power, performance and scalability evaluation
    platform comparison parameters
    Platform Comparison Parameters

    Atom area: 8.58 mm2

    Core 2 area: 22.4 mm2

    T3 one core area: 2.5 mm2

    mapping computation on thin cores
    Mapping Computation on Thin Cores

    Flat Mapping

    2 blocks

    1 block

    IQ

    IQ

    IQ

    IQ

    IQ

    IQ

    D

    D

    D

    D

    D

    D

    D

    R

    R

    R

    R

    R

    R

    R

    IQ

    • Used by early EDGE designs to map speculative dataflow blocks

    4 blocks

    static placement flat mapping
    Static Placement (Flat Mapping)
    • Aims to exploit intra-block parallelism using 1-wide cores
    • Applied to one block at a time
      • 128!possible schedules
      • More complicated considering register locations and compiler phases (block generation and register allocation)
    • Good heuristicsbased on estimated critical path within the block
      • Machine learning can help [LCPC08,PACT08]
    • Is this the right solution?
      • Hard to achieve a global solution!
      • Observation: Intra-block communication is a bottleneck
    reducing communication overheads
    Reducing Communication Overheads

    Deep Mapping

    1 block

    IQ

    IQ

    IQ

    IQ

    IQ

    IQ

    IQ

    D

    D

    D

    D

    D

    D

    D

    R

    R

    R

    R

    R

    R

    R

    2 blocks

    • Maybe a better choice for slightly stronger cores

    4 blocks

    dynamic placement deep mapping
    Dynamic Placement (Deep Mapping)
    • Aims to maximize inter-block parallelism [MICRO08]
    • Intra-block parallelism restricted by the issue width of each core
    • Saving cross-core communication by restricting it to register and memory communication among blocks
      • Cross-core communication causes significant power overheads compared to computation [Bill Dally SC10]
    • Simpler iCache structures
    • Hardware should dynamically select the cores for mapping blocks
    different mapping strategies
    Different Mapping Strategies
    • Distributing all execution across cores
      • Intra-block dataflow communication across cores
      • Inter-block register communication across cores
    • Mapping one block per core
      • Intra-block dataflow communication within core
      • Inter-block register communication across cores

    Flat Mapping

    Deep Mapping

    traditional out of order execution
    Traditional Out-of-order Execution

    • Hardware generates a dynamic dataflow graph of the fetch instructions
      • Execute instructions out of order
      • Commit in-order and update architectural state of the program
    • Complicated logic for generating and maintaining the graph dynamically
    • Does that scale?

    In-order Fetch

    Scheduling Logic

    Reorder Buffer

    Register Renaming

    Registers

    In-order Commit

    Memory

    scaling challenges of fat cores
    Scaling Challenges of Fat Cores

    • Components in charge of construction/maintaining of instruction window
      • Complexity grows quadratically
        • # of ports
        • Logic
    • Slowdown in power scaling  these structures do not scale any more!

    In-order Fetch

    Scheduling Logic

    Register Renaming

    Reorder Buffer

    Registers

    Memory

    In-order Commit

    compiler can help
    Compiler Can Help

    • Most of the dependence graph is known at compile time
      • Some long-term memory and register dependences are not known statically
    • Compiler can generate and give these graphs to hardware
      • Significantly reduces dependency detection, fetch and prediction overheads

    Block Prediction & Fetch

    Reg File

    Memory

    In-order Block Commit

    distributing execution across cores
    Distributing Execution across Cores
    • Mapping blocks of dataflow instructions on a grid of many wimpy cores [MICRO08]
      • Maximize performance
      • Small communication overhead
    • Different tradeoffs
      • Different types of parallelism and communication
        • Among instructions in each block
        • Among parallel blocks
      • Characteristics of light-weight cores
      • Design space exploration
    different mapping strategies1
    Different Mapping Strategies
    • Flat mapping (Traditional)
      • Exploits intra-block parallelism
      • Compiler can help: scheduling and register allocation [LCPC08, PACT08]
      • Hard to achieve a global solution!
      • Intra-block dataflow communication is a bottleneck
    • Deep mapping
      • Saves cross-core communication by limiting dataflow to single cores
      • Limited intra-block parallelism
      • Simpler instruction cache structures
      • Dynamically select the cores for mapping blocks

    Flat Mapping

    Deep Mapping

    space exploration for distributing computation
    Space Exploration for Distributing Computation

    Percent of total number of hops across all hops using flat mapping

    SPEC speedups over one single dual-issue cores

    Inter-block

    Intra-block

    Inter-block

    Register communication is now bottleneck

    Deep mapping better for 2w issue

    Saves energy and delay

    Flat mapping works for 1w issue

    next block prediction2
    Next Block Prediction
    • Coarse-grain branch prediction
      • Trace processors
      • Multiple predictions per access
    • Similar problem with dataflow blocks
      • Predicting the next block
      • Multiple predicate paths in each block

    Predicted block path

    next block prediction3
    Next Block Prediction
    • Branches in blocks converted to predicates
    • Predicting next block
      • Multiple branches per block
      • The taken branch depends on executed predicate path
    • Solution
      • Predict taken predicate path and use it to predict next block
      • Speculatively execute dataflow predicate path!

    RD2

    tstlti

    RD1

    f

    t

    -

    *

    BR2

    +

    tstz

    BR1

    f

    BR3

    t

    WR2

    WR1

    Predicted block path

    Taken predicated path

    block reissue1
    Block Reissue
    • Instruction reuse: trace caches and loop buffer in out-of-order processors
    • There is no control flow with each block so blocks can be reissued if they are still in the window
    • Reissuing critical instructions following pipeline flushes to reduce misspeculation penalty [HPCA11]
    • Saves energy & delay by reducing fetches and decode by about 50%
    reducing multi core register communication
    Reducing Multi-core Register Communication
    • Problem: Cross-core communication via distributed registers
      • Distributed registers forwarding [TRIPS] or network broadcasts [TLS]
    • Solution: Late values via very low-overhead direct register value bypassing while the rest use register forwarding [HPCA11]

    B0

    D

    D

    D

    D

    B1

    R

    R

    R

    R

    Register bypassing for critical register values

    Register value forwarding via register home core

    spec fp performance energy results1
    SPEC FP Performance/Energy Results

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Pollack’s

    45x energy-delay^2 improvement with 16 cores!

    bottlenecks in tflex composable edge design
    Bottlenecks in TFlexComposable EDGE Design
    • Intra-block operand communication due to fine-grain instruction distribution among cores
    • Inter-block register communication among cores
    • Expensive refills after pipeline flushes
    • Poor next-block prediction accuracy and low speculation rate due to predicates
    • Compiler-generated fanout trees built for high-fanout operand delivery
    previously presented in thesis proposal
    Previously Presented in Thesis Proposal
    • Deep block mapping [MICOR08, PACT08, LCPC08]
      • More coarse-grained parallelism and less cross-core operand traffic by mapping each block into one core
    • Register bypassing [HPCA11]
      • Reducing cross-core register communication delay by bypassing register values predicted to be critical directly from producing to consuming cores
    • Block reissue [HPCA11]
      • Reducing pipeline flush penalties by allowing instructions in previously executed blocks to be reissued while they are still in the instruction queue
    spec int performance energy results1
    SPEC INT Performance/Energy Results

    T3

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Pollack’s

    TFlex-8 is close to TRIPS while T3-8 outperforms TRIPS by 1.43 and 25% less energy

    results breakdown2
    Results Breakdown

    Major delay savers: IPP, block mapping and block reissue

    Major energy savers: EOBs, block mapping and block reissue

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    spec int cross platform comparison1
    SPEC INT Cross-Platform Comparison

    P

    Few cores (1 to 2)  Energy efficient with high performance

    More cores (4 to 8)  Increased performance for low energy cost

    P

    E

    E

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Efficiently covering much larger operation spectrum than DVFS

    spec fp performance energy results2
    SPEC FP Performance/Energy Results

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Pollack’s

    spec fp cross platform comparison1
    SPEC FP Cross-Platform Comparison

    P

    E

    P

    E

    # of cores

    # of cores

    Normalized core + L1 energy consumed over single dual-issue cores

    Speedup over single dual-issue cores

    Significantly improved performance and energy efficiency compared to INT