Microarchitectural wire management for performance and power in partitioned architectures
Sponsored Links
This presentation is the property of its rightful owner.
1 / 26

Microarchitectural Wire Management for Performance and Power in partitioned architectures PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

Processor Architecture. Microarchitectural Wire Management for Performance and Power in partitioned architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy. Overview/Motivation. Wire delays hamper performance.

Download Presentation

Microarchitectural Wire Management for Performance and Power in partitioned architectures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Processor Architecture

Microarchitectural Wire Management for Performance and Power in partitioned architectures

Rajeev Balasubramonian

Naveen Muralimanohar

Karthik Ramani

Venkatand Venkatachalapathy

University of Utah


Overview/Motivation

  • Wire delays hamper performance.

  • Power incurred in movement of data

    • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04)

    • MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003)

  • Abundant number of metal layers


  • Wire characteristics

    • Wire Resistance and capacitance per unit length

    • Width   R , C 

    • Spacing  C 

    • Delay  (as delay  RC), Bandwidth 


    Design space exploration

    • Tuning wire width and spacing

    2d

    d

    Resistance

    Resistance

    B Wires

    Capacitance

    Capacitance

    Bandwidth


    Transmission Lines

    • Similar to L wires - extremely low delay

    • Constraining implementation requirements!

      • Large width

      • Large spacing between wires

      • Design of sensing circuits

    • Implemented in test CMOS chips


    Design space exploration

    • Tuning Repeater size and spacing

    Delay

    Power

    Traditional Wires

    Large repeaters

    Optimum spacing

    Power Optimal Wires

    Smaller repeaters

    Increased spacing


    Design space exploration

    Delay

    Optimized

    B wires

    Bandwidth

    Optimized

    W wires

    Power

    Optimized

    P wires

    Power and B/W

    Optimized

    PW wires

    Fast, low bandwidth

    L wires


    Heterogeneous Interconnects

    • Intercluster global Interconnect

      • 72 B wires

        • Repeaters sized and spaced for optimum delay

      • 18 L wires

        • Wide wires and large spacing

        • Occupies more area

        • Low latencies

      • 144 PW wires

        • Poor delay

        • High bandwidth

        • Low power


    Outline

    • Overview

    • Design Space Exploration

    • Heterogeneous Interconnects

    • Employing L wires for performance

    • PW wires: The power optimizers

    • Evaluation

    • Results

    • Conclusion


    L1 Cache pipeline

    Cache

    Access

    5c

    Eff. Address Transfer 10c

    L

    S

    Q

    L1 D

    Cache

    Data return at 20c

    Mem. Dep

    Resolution

    5c


    Exploiting L-Wires

    Cache

    Access

    5c

    Eff. Address Transfer 10c

    L

    S

    Q

    L1 D

    Cache

    8-bit Transfer 5c

    Data return at 14c

    Partial

    Mem. Dep

    Resolution

    3c


    L wires: Accelerating cache access

    • Transmit LSB bits of effective address through L wires

      • Partial comparison of loads and stores in LSQ

      • Faster memory disambiguation

      • Introduces false dependences ( < 9%)

    • Indexing data and tag RAM arrays

      • LSB bits can prefetch data out of L1$

      • Reduce access latency of loads


    L wires: Narrow bit width operands

    • Transfer of 10 bit integers on L wires

      • Schedule wake up operations

      • Reduction in branch mispredict penalty

      • A predictor table of 8K two bit counters

        • Identifies 95% of all narrow bit-width results

        • Accuracy of 98%

    • Implemented in the PowerPC!


    PW wires: Power/Bandwidth efficient

    • Idea: steer non-critical data through energy efficient PW interconnect

    • Transfer of data at instruction dispatch

      • Transfer of input operands to remote register file

      • Covered by long dispatch to issue latency

    • Store data


    Evaluation methodology

    • A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0

    • Crossbar interconnects

    • Centralized front-end

      • I-Cache & D-Cache

      • LSQ

      • Branch Predictor

    L1 D

    Cache

    Cluster

    B wires (2 cycles)

    L wires (1 cycle)

    PW wires (3 cycles)


    Evaluation methodology

    • A dynamically scheduled 16 cluster modeled in Simplescalar-3.0

    • Ring latencies

      • B wires ( 4 cycles)

      • PW wires ( 6 cycles)

      • L wires (2 cycles)

    D-cache

    I-Cache

    Cluster

    LSQ

    Cross bar

    Ring interconnect


    IPC improvements: L wires

    L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system


    Four cluster system: ED2 gains


    Sixteen Cluster system: ED2 gains


    Conclusions

    • Exposing the wire design space to the architecture

    • A case for micro-architectural wire management!

    • A low latency low bandwidth network alone helps improve performance by upto 7%

    • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect

    • Entails hardware complexity


    Future work

    • A preliminary evaluation looks promising

    • Heterogeneous interconnect entails complexity

    • Design of heterogeneous clusters

    • Energy efficient interconnect


    Questions and Comments?

    Thank you!


    Backup


    L wires: Accelerating cache access

    • TLB access for page look up

      • Transmit a few bits of Virtual page number on L wires

      • Prefetch data our of L1$ and TLB

      • 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)


    Model parameters

    • Simplescalar-3.0 with separate integer and floating point queues

    • 32 KB 2 way Instruction cache

    • 32 KB 4 way Data cache

    • 128 entry 8 way I and D TLB


    Overview/Motivation:

    • Three wire implementations employed in this study

      • B wires: traditional

        • Optimal delay

        • Huge power consumption

      • L wires:

        • Faster than B wires

        • Lesser bandwidth

      • PW wires:

        • Reduced power consumption

        • Higher bandwidth compared to B wires

        • Increased delay through the wires


  • Login