Microarchitectural wire management for performance and power in partitioned architectures
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Microarchitectural Wire Management for Performance and Power in partitioned architectures PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Processor Architecture. Microarchitectural Wire Management for Performance and Power in partitioned architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy. Overview/Motivation. Wire delays hamper performance.

Download Presentation

Microarchitectural Wire Management for Performance and Power in partitioned architectures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Microarchitectural wire management for performance and power in partitioned architectures

Processor Architecture

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Rajeev Balasubramonian

Naveen Muralimanohar

Karthik Ramani

Venkatand Venkatachalapathy

University of Utah


Overview motivation

Overview/Motivation

  • Wire delays hamper performance.

  • Power incurred in movement of data

    • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)

    • MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)

  • Abundant number of metal layers


  • Wire characteristics

    Wire characteristics

    • Wire Resistance and capacitance per unit length

    • Width   R , C 

    • Spacing  C 

    • Delay  (as delay  RC), Bandwidth 


    Design space exploration

    Design space exploration

    • Tuning wire width and spacing

    2d

    d

    Resistance

    Resistance

    B Wires

    Capacitance

    Capacitance

    Bandwidth


    Transmission lines

    Transmission Lines

    • Similar to L wires - extremely low delay

    • Constraining implementation requirements!

      • Large width

      • Large spacing between wires

      • Design of sensing circuits

    • Implemented in test CMOS chips


    Design space exploration1

    Design space exploration

    • Tuning Repeater size and spacing

    Delay

    Power

    Traditional Wires

    Large repeaters

    Optimum spacing

    Power Optimal Wires

    Smaller repeaters

    Increased spacing


    Design space exploration2

    Design space exploration

    Delay

    Optimized

    B wires

    Bandwidth

    Optimized

    W wires

    Power

    Optimized

    P wires

    Power and B/W

    Optimized

    PW wires

    Fast, low bandwidth

    L wires


    Heterogeneous interconnects

    Heterogeneous Interconnects

    • Intercluster global Interconnect

      • 72 B wires

        • Repeaters sized and spaced for optimum delay

      • 18 L wires

        • Wide wires and large spacing

        • Occupies more area

        • Low latencies

      • 144 PW wires

        • Poor delay

        • High bandwidth

        • Low power


    Outline

    Outline

    • Overview

    • Design Space Exploration

    • Heterogeneous Interconnects

    • Employing L wires for performance

    • PW wires: The power optimizers

    • Evaluation

    • Results

    • Conclusion


    Microarchitectural wire management for performance and power in partitioned architectures

    L1 Cache pipeline

    Cache

    Access

    5c

    Eff. Address Transfer 10c

    L

    S

    Q

    L1 D

    Cache

    Data return at 20c

    Mem. Dep

    Resolution

    5c


    Microarchitectural wire management for performance and power in partitioned architectures

    Exploiting L-Wires

    Cache

    Access

    5c

    Eff. Address Transfer 10c

    L

    S

    Q

    L1 D

    Cache

    8-bit Transfer 5c

    Data return at 14c

    Partial

    Mem. Dep

    Resolution

    3c


    L wires accelerating cache access

    L wires: Accelerating cache access

    • Transmit LSB bits of effective address through L wires

      • Partial comparison of loads and stores in LSQ

      • Faster memory disambiguation

      • Introduces false dependences ( < 9%)

    • Indexing data and tag RAM arrays

      • LSB bits can prefetch data out of L1$

      • Reduce access latency of loads


    L wires narrow bit width operands

    L wires: Narrow bit width operands

    • Transfer of 10 bit integers on L wires

      • Schedule wake up operations

      • Reduction in branch mispredict penalty

      • A predictor table of 8K two bit counters

        • Identifies 95% of all narrow bit-width results

        • Accuracy of 98%

    • Implemented in the PowerPC!


    Pw wires power bandwidth efficient

    PW wires: Power/Bandwidth efficient

    • Idea: steer non-critical data through energy efficient PW interconnect

    • Transfer of data at instruction dispatch

      • Transfer of input operands to remote register file

      • Covered by long dispatch to issue latency

    • Store data


    Evaluation methodology

    Evaluation methodology

    • A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0

    • Crossbar interconnects

    • Centralized front-end

      • I-Cache & D-Cache

      • LSQ

      • Branch Predictor

    L1 D

    Cache

    Cluster

    B wires (2 cycles)

    L wires (1 cycle)

    PW wires (3 cycles)


    Evaluation methodology1

    Evaluation methodology

    • A dynamically scheduled 16 cluster modeled in Simplescalar-3.0

    • Ring latencies

      • B wires ( 4 cycles)

      • PW wires ( 6 cycles)

      • L wires (2 cycles)

    D-cache

    I-Cache

    Cluster

    LSQ

    Cross bar

    Ring interconnect


    Ipc improvements l wires

    IPC improvements: L wires

    L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system


    Four cluster system ed 2 gains

    Four cluster system: ED2 gains


    Sixteen cluster system ed 2 gains

    Sixteen Cluster system: ED2 gains


    Conclusions

    Conclusions

    • Exposing the wire design space to the architecture

    • A case for micro-architectural wire management!

    • A low latency low bandwidth network alone helps improve performance by upto 7%

    • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect

    • Entails hardware complexity


    Future work

    Future work

    • A preliminary evaluation looks promising

    • Heterogeneous interconnect entails complexity

    • Design of heterogeneous clusters

    • Energy efficient interconnect


    Questions and comments

    Questions and Comments?

    Thank you!


    Backup

    Backup


    L wires accelerating cache access1

    L wires: Accelerating cache access

    • TLB access for page look up

      • Transmit a few bits of Virtual page number on L wires

      • Prefetch data our of L1$ and TLB

      • 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)


    Model parameters

    Model parameters

    • Simplescalar-3.0 with separate integer and floating point queues

    • 32 KB 2 way Instruction cache

    • 32 KB 4 way Data cache

    • 128 entry 8 way I and D TLB


    Overview motivation1

    Overview/Motivation:

    • Three wire implementations employed in this study

      • B wires: traditional

        • Optimal delay

        • Huge power consumption

      • L wires:

        • Faster than B wires

        • Lesser bandwidth

      • PW wires:

        • Reduced power consumption

        • Higher bandwidth compared to B wires

        • Increased delay through the wires


  • Login