bulletproof a defect tolerant cmp switch architecture l.
Download
Skip this Video
Download Presentation
BulletProof: A Defect-Tolerant CMP Switch Architecture

Loading in 2 Seconds...

play fullscreen
1 / 17

BulletProof: A Defect-Tolerant CMP Switch Architecture - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'BulletProof: A Defect-Tolerant CMP Switch Architecture' - glora


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bulletproof a defect tolerant cmp switch architecture

Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang†

Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky†

‡Advanced Computer Architecture Lab †Department of Electrical and Computer Engineering

University of Michigan University of Texas at Austin

BulletProof: A Defect-Tolerant CMPSwitch Architecture
introduction
Introduction
  • Reliability is a critical aspect of any computer design
  • System designers target for very small failure rates
  • Today reliability targets are met by using fault-avoidance design techniques
    • use of conservative design margins
  • For future process technologies it would

be impossible to avoid system failures

by using conservative design margins

    • need defect-tolerant design techniques

Now

Transistor Reliability

Future

Transistor Lifetime (years)

reliable system design space
Reliable System Design Space
  • Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components – “BulletProof”

TYPE OF DEFECT

DESIGN FEATURE

DMR

DMR

Diva

Razor

ECC

TMR

ECC - memory

TMR

cache-line swap-out

memory-array spares

BulletProof

Mainstream Solutions

Research-stage Solutions

High-end Solutions

Specialized Solutions

cmp switch architecture
CMP Switch Architecture
  • Goal: A defect tolerant CMP switch design
  • Baseline switch architecture is provided by Li-Shiuan Peh
  • Implements the routing and flow-control functions required for transmitting packets in a 2D Torus network
  • Wormhole switch pipelined

at the flit level (32-bit flits)

  • Dimensional order routing
  • Specified in Verilog and

synthesized to a

gate-level netlist

~ 9K logic gates and

1700 sequential elements

soft errors seu vulnerability
Soft Errors (SEU) Vulnerability
  • In earlier work we studied the vulnerability of the switch architecture to soft-errors
    • Only 3.2% of faults eventually cause an error
  • Age-related wear-out silicon defects is a more challenging reliability threat for future technologies
  • In this work we focus on solutions for in-field silicon defects
  • These solutions also provide soft-error tolerance to the design
self repairing systems
Self-Repairing Systems
  • Defect-tolerant self-repairing systems need to support:
    • Error Detection
    • System Diagnosis (locate the origin of the error)
    • System Repair
    • System Recovery
  • Key idea:
    • error detection must be performance efficient
      • continuously check execution for errors
    • diagnosis, repair and recovery are insensitive on performance
      • get invoked only when an error is detected (rare scenario)
      • trade-off performance for more cost efficient techniques
traditional defect tolerant techniques

M

V

M

M

ECC bits

R1 R2D1 R3D2 D3 D4 R4D5 D6 D7 D8

Data bits

Traditional Defect-Tolerant Techniques
  • Traditional techniques for designing defect-tolerant systems:
    • Triple Modular Redundancy (TMR)
      • Forward recovery
      • Applicable to both combinational

and sequential logic

      • Can not tolerate more than one

defective modules

      • Area and power overhead ~ 3X
    • Error Correction Codes (ECC)
      • Lower overhead solution
      • Applicable only for state

holding structures and busses

error detection low cost domain specific technique

Error

FLIT

Error Detection: Low-Cost Domain Specific Technique
  • The synthesized netlist of the added components account for ~10% of the total switch area
  • Provide error detection for both hard and soft errors

Routing Logic

Header

CRC

Checker

CRC

Checker

Cross-bar

Input

Buffers

CRC

Cross-bar Controller

Buffer

Checker

Routing Logic

ARB

ARB

adding defect resiliency with lower cost

A

A

F

F

I

I

B

B

G

G

C

C

J

J

D

D

H

H

E

E

Adding Defect Resiliency With Lower Cost
  • Automatic Cluster Decomposition
  • Balanced recursive min-cut heuristic algorithm

Input: a) design’s gate-level netlist

b) number of partitions

Output: a partitioned netlist

Goal:

    • Balance partition sizes:

- smaller partition higher resilience

    • Minimize cut edges:

- reduce cost overhead

- reduce vulnerable logic

  • Partitions can have both

combinational and sequential logic

partition sparing silicon protection factor

A

A

F

F

SPF – Defect Tolerance

B

B

I

I

G

G

7.6X more defects

tolerated per unit area

C

C

J

J

D

D

H

H

E

E

Partition Sparing – Silicon Protection Factor
  • Partition sparing:
    • Only one spare is active for

each partition of the switch

    • Replace voting logic with

spare swapping logic

    • Lower power overhead
    • A defect is fatal if it hits the

last spare of a partition or

the spare swapping logic

Silicon Protection Factor (SPF) =

    • The number of defect in a design are proportional to the design’s area
    • Enables to compare different defect tolerant designs

15.8X more

defects tolerated

1 extra spare

per partition

Mean Defects to Failure

Area Overhead

system recovery
System Recovery

a: Correctly routed flit

  • Add a Recovery Pointer to each

input buffer

  • Recovery pointers advance 4 cycles

after the input controller grants

the requesting output channel

    • Guarantees that flit is CRC checked
  • On error detection:
    • All CRC checkers drop

outgoing flits

    • Switch pipeline is flushed
    • Head pointers are set to recovery

pointers

    • Restart execution

b, c: In the switch pipeline

d: Next flit to be routed

e: Last flit buffered

e

d

Input

e

d

c

b

a

e

d

c

b

a

Buffers

Tail

Head

Recovery

Head

Error Detection Signal

CRC

Checker

Routed

Flit

Routed

Routed

Interconnect

Flit

Flit

CRC

CRC

Switch

Checker

Checker

Recovery

Routed

Logic

Flit

Routed

Flit

CRC

Checker

CRC

Checker

system diagnosis and repair
System Diagnosis and Repair
  • Iterative trial-and-error technique
  • Built-In-Self-Test (BIST)
    • For each partition keep automatically generated test vectors in ROM
    • Apply test vectors to each partition through scan chains to locate the defective partition

Recover to the last

correct state of the switch

For partition i swap in the spare for the current copy and restart execution

Increase i

Yes

Error detected?

i < # partitions?

Yes

No

No

Fatal Defect

Continue Execution

exploring defect tolerant cmp switch designs
Exploring Defect-Tolerant CMP Switch Designs

How does these techniques affect the system’s lifetime?

12 partitions (cmps)

2/5 spare input controllers

1 spare per cmp. (rest)

Iterative replay

Area = 1.76X

SPF = 2.53

206 partitions

1 spare per partition

Built-In-Self-Test

Area = 3.16X

SPF = 5.54

12 partitions (cmps)

TMR

Area = 3.04X

SPF = 1.54

206 partitions

2 spares per partition

Iterative replay

Area = 3.4X

SPF = 11.1

206 partitions

1 spare per partition

Iterative replay

Area = 2.3X

SPF = 7.6

Pareto Sub-optimal Designs

more robust designs

Pareto Optimal Designs

cheaper designs

cheaper

more robust designs

bathtub curve a model for semiconductor hard failures
“Bathtub Curve”: A model for semiconductor hard failures
  • The lifetime failure rate for semiconductor systems follows what is known as the bathtub curve
  • Trend for future process technologies:
    • Failure rate of grace period gets larger
    • Breakdown period is earlier in system’s lifetime

Future process technologies

Failure Rate (FIT)

Time

Infant Period

Grace Period

Breakdown Period

system lifetime a post 65nm technology case scenario
System Lifetime – A Post 65nm Technology Case Scenario

120000

108000

TMR

SPF=1.54

3/5 spare IC

1 spare rest

SPF=3.01

96000

84000

1 defect

every two years

72000

Failure Rate (FIT)

60000

48000

2 spares

SPF=11.11

36000

1 spare

SPF=7.63

24000

12000

conclusions future work
Conclusions – Future Work

Conclusions

  • Traditional mechanisms are insufficient for tolerating moderate numbers of defects
  • Domain-specific techniques along with resource sparing, iterative diagnosis and reconfiguration are more effective
  • Decomposing the design into modest-sized partitions is the most effective granularity to apply redundancy

Future Work

  • Use of spare components based on component wear-out profiles
  • Explore low-cost defect-tolerant techniques for microprocessors