Pbexplore a framework for cil exploration of partial bypasses in embedded processors
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

S. L. C. PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors. Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2. 2 Strategic CAD Labs, Intel, Hudson, MA, USA. 1 Center For Embedded Computer Systems,

Download Presentation

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Pbexplore a framework for cil exploration of partial bypasses in embedded processors

S

L

C

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors

Aviral Shrivastava1 Nikil Dutt1

Alex Nicolau1 Eugene Earlie2

2Strategic CAD Labs, Intel,

Hudson, MA, USA

1Center For Embedded Computer Systems,

University of California, Irvine, CA, USA


Bypassing improves performance

RF

X2

F

D

OR

X1

WB

Bypassing Improves Performance

  • Pipelining improves performance

    • Limited by pipeline hazards

  • Bypasses eliminate certain data hazards

    • Further improve performance

RF

X2

F

D

OR

WB

X1

R1

R4  R4 + R1

R1  R2 + R3

R1

R4  R4 + R1

R1  R2 + R3


Impact of bypassing

Impact of Bypassing

  • Wiring congestion

  • Cycle time

    • Bypasses may be a part of timing-critical path

  • Overall chip complexity

    • deeply pipelined

    • out-of-order processors

  • Area and Power consumption

    • Wide multiplexers

    • Bypass Control logic

    • Bypass wires

M1

RF

X2

F

D

X1

WB

M2

OR

P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995

A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.


Problem solution and problem

Problem, Solution and Problem

  • Problem – How do I customize bypasses?

    • Important for Embedded Systems

  • Solution –

    • Keep only the most beneficial bypasses

    • Area, Power and Performance trade-off

RF

X2

F

D

OR

X1

WB

  • Problems –

    • How to Compile for a processor with partial bypassing?

    • Requires Compiler-in-the-Loop Exploration


Related work

Related Work

  • Optimizations for partial bypassing

    • P. Ahuja et al. [MICRO’95]

      • Manual code generation

    • M. Buss et al. [CASES’01]

      • Optimize inter-cluster copy operations

    • K. Fan et al. [ASSP’03]

      • FU-allocation strategy

        Only for VLIW processors

    • A. Shrivastava et al. [CODES’04]

      • A generic “pipeline hazard detection” mechanism to generate bypass-sensitive code

We present

  • A generic Compiler-in-the-Loop bypass exploration framework

  • Perform area-power-performance trade-off on Intel XScale by varying bypasses


Pbexplore a cil exploration framework

Application

Application

Synthesis

Tool

Bypass-sensitive

Compiler

Bypass-control

Logic

Executable

Power

Simulator

Cycle-accurate

Simulator

Stimulus

Energy Estimate

Execution Cycles

Area Estimate

Report

PBExplore: A CIL Exploration Framework

Bypass

Configuration


Bypass sensitive scheduling

Bypass Sensitive Scheduling

  • Bypasses transfer data between dependent operations

  • Missing bypasses cause pipeline hazard

No Hazard

Hazard

RF

X2

F

D

OR

X1

WB

R1

R1

R1

R4  R4 + R1

R1  R2 + R3

R1  R2 + R3

R1  R2 + R3

  • Bypass-sensitive compiler should be able to

    • detect and avoid pipeline hazards


Operation table

RF

BRF

C3

C1

C2

C4

C5

X2

F

D

OR

X1

WB

Operation Table

Operation Table for ADD R1 R2 R3

Details are in the paper !!

1. F

2. D

3. OR

ReadOperands

R2

C1 RF

R3

C2 RF

C5 BRF

DestOperands

R1 RF

4. X1

WriteOperands

R1

C4 BRF

5. X2

6. XWB

WriteOperands

R1

C3 RF

  • Operation Table is a binding between

    • Operation and Processor Resources and Registers

  • Can detect Resource Hazards

    • OTs model processor resources

  • Can detect Data Hazards

    • OTs model processor registers


Experiments

Experiments

  • Experiments I – Need of a CIL framework

    • Need of Bypass-sensitive Compiler-in-the-Loop Exploration

    • Traditional exploration versus Bypass-sensitive Compiler-in-the-Loop exploration

  • Experiments II – CIL Exploration

    • Use of Bypass-sensitive Compiler-in-the-Loop Exploration

    • Perform Power-Performance-Area trade-offs

    • Identify alternate interesting design points


Experiments i framework

Bypass-sensitive Compiler-in-the-Loop Exploration

Application

Application

Traditional Exploration

OT-based Compiler

gcc –O3

Executable

Executable

Cycle Accurate

Simulator

Cycle Accurate

Simulator

Traditional Cycles

CIL Cycles

Experiments I - Framework

Traditional Exploration versus

Bypass-sensitive Compiler-in-the-Loop Exploration

Bypass

Configuration


Experiments i setup

Experiments I - Setup

D1

D2

DWB

  • 7 pipeline stages can bypass result

  • We vary which pipeline stage bypasses a result

    • 27 = 128 bypass configurations

    • Encode bypass configuration

      • <DWB D2 MWB M2 XWB X2 X1>

    • Configuration 28 = <0011100>

      • Bypass paths from MWB, M2 and XWB are present

F1

F2

ID

RF

X1

X2

XWB

M1

M2

MWB


Bypass explorations on xscale

Traditional

bitcount

CIL

Bypass Explorations on XScale

1250000

1200000

1150000

1100000

Execution Cycles

1050000

1000000

950000

900000

850000

0

32

64

96

128

Bypass Source Configurations

  • CIL-compiler can effectively exploit the bypass configuration

  • Significant performance difference


X bypass explorations in xscale

D1

D2

DWB

1200000

bitcount

Traditional

M1

M2

MWB

CIL

1150000

1100000

F1

F2

ID

RF

X1

X2

XWB

1050000

Execution Cycles

1000000

950000

900000

850000

-

X1

X2

XWB

X2 X1

XWB X2

XWB X1

XWB X2 X1

X-bypass Configuration

X-bypass explorations in XScale

Difference in trends


M bypass explorations in xscale

D1

D2

DWB

F1

F2

ID

RF

X1

X2

XWB

M1

M2

MWB

M-bypass explorations in XScale

Difference in trends


D bypass exploration in xscale

980000

Traditional

bitcount

CIL

960000

D1

D2

DWB

940000

F1

F2

ID

RF

X1

X2

XWB

Execution Cycles

920000

M1

M2

MWB

900000

880000

860000

-

DWB

D2

DWB D2

D Bypass Configurations

D-bypass exploration in XScale

Difference in trends


Experiments ii setup

Experiments II - Setup

Power-Performance-Area trade-offs

  • Scheduler

    • Exhaustive instruction reordering within Basic Blocks

  • Synthesis Tool

    • Synopsys Design compiler 2001.10

    • 0.8µ library lsi_10k

  • Power Estimation

    • Synopsys power_estimate

Bypass

Configuration

Application

Application

Synthesis

Tool

Bypass-sensitive

Compiler

Bypass Control

Logic

Executable

Cycle-accurate

Simulator

Power

Simulator

Report

Intel XScale Microarchitecture Programmers Reference Manual, http://www.developer.intel.com

M. R. Gauthus et al. MiBench: A free commercially representative…, IEEE Workshop… 2001

Synopsys Design Compiler, 2001, http://www.synopsys.com/products/logic/design compiler.html


Performance energy area trade off

Point 1

Point 1

Point 2

Point 2

Performance-Energy-Area Trade-off

  • Design Point 1

    • no bypass from MWB and XWB to first operand

    • 18% less area and 14% less energy consumption of bypass control logic

    • 2% performance loss

  • Design Point 2

    • Only D2 and X2 bypass to first operand

    • 25% less area and 16% less energy consumption of bypass control logic

    • 6% performance loss


Summary

Summary

  • Bypassing improves performance but is costly in terms of area and power

  • Partial bypassing presents valuable trade-offs, however poses challenges in compilation

  • We presented PBExplore – A Compiler-in-the-Loop Exploration framework to explore partial bypasses.

    • PBExplore uses Operation Tables to generate bypass-sensitive code

    • PBExplore automatically synthesizes bypass control logic to explore power and area trade-offs

  • PBExplore is able to discover interesting design points that trade-off performance for power and area of bypass control logic


Thank you

Thank You


Pipeline hazard detection using ot

RF

BRF

C3

C1

C2

C4

C5

X2

F

D

OR

X1

WB

Pipeline Hazard Detection using OT


Resource hazard detection

RF

BRF

C3

C1

C2

C4

C5

X2

F

D

OR

X1

WB

Resource Hazard Detection

Resource

Hazard


Data hazard detection

RF

BRF

C3

C1

C2

C4

C5

X2

F

D

OR

X1

WB

Data Hazard Detection

Data

Hazard


  • Login