scaling internet routers using optics uw october 16 th 2003 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Scaling Internet Routers Using Optics UW, October 16 th , 2003 PowerPoint Presentation
Download Presentation
Scaling Internet Routers Using Optics UW, October 16 th , 2003

Loading in 2 Seconds...

play fullscreen
1 / 50

Scaling Internet Routers Using Optics UW, October 16 th , 2003 - PowerPoint PPT Presentation


  • 380 Views
  • Uploaded on

Scaling Internet Routers Using Optics UW, October 16 th , 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu. Department of Electrical Engineering, Stanford University

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scaling Internet Routers Using Optics UW, October 16 th , 2003' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scaling internet routers using optics uw october 16 th 2003

Scaling Internet Routers Using OpticsUW, October 16th, 2003

Nick McKeown

Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard.

Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu.

Department of Electrical Engineering, Stanford University

Paper: http://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdf

Web site: http://klamath.stanford.edu/or

backbone router capacity
Backbone router capacity

1Tb/s

100Gb/s

10Gb/s

Router capacity per rack

2x every 18 months

1Gb/s

backbone router capacity3
Backbone router capacity

1Tb/s

100Gb/s

Traffic

2x every year

10Gb/s

Router capacity per rack

2x every 18 months

1Gb/s

extrapolating
Extrapolating

100Tb/s

2015:

16x disparity

Traffic

2x every year

Router capacity

2x every 18 months

1Tb/s

consequence
Consequence
  • Unless something changes, operators will need:
    • 16 times as many routers, consuming
    • 16 times as much space,
    • 256 times the power,
    • Costing 100 times as much.
  • Actually need more than that…
stanford 100tb s internet router

Optical

Switch

Electronic

Linecard #1

Electronic

Linecard #625

160-320Gb/s

160-320Gb/s

40Gb/s

  • Line termination
  • IP packet processing
  • Packet buffering
  • Line termination
  • IP packet processing
  • Packet buffering

40Gb/s

160Gb/s

40Gb/s

100Tb/s = 640 * 160Gb/s

40Gb/s

Stanford 100Tb/s Internet Router

Goal: Study scalability

  • Challenging, but not impossible
  • Two orders of magnitude faster than deployed routers
  • We will build components to show feasibility
throughput guarantees
Throughput Guarantees
  • Operators increasingly demand throughput guarantees:
    • To maximize use of expensive long-haul links
    • For predictability and planning
  • Despite lots of effort and theory, no commercial router today has a throughput guarantee.
requirements of our router
Requirements of our router
  • 100Tb/s capacity
  • 100% throughput for all traffic
  • Must work with any set of linecards present
  • Use technology available within 3 years
  • Conform to RFC 1812
what limits router capacity
What limits router capacity?

Approximate power consumption per rack

Power density is the limiting factor today

slide11

Juniper TX8/T640

Alcatel 7670 RSP

TX8

Avici TSR

Chiaro

limits to scaling
Limits to scaling
  • Overall power is dominated by linecards
    • Sheer number
    • Optical WAN components
    • Per packet processing and buffering.
  • But power density is dominated by switch fabric
trend multi rack routers reduces power density13

Limit today ~2.5Tb/s

    • Electronics
    • Scheduler scales <2x every 18 months
    • Opto-electronic conversion

Switch

Linecards

Trend: Multi-rack routersReduces power density
multi rack routers
Multi-rack routers

Switch fabric

Linecard

In

WAN

Out

In

WAN

Out

question
Question
  • Instead, can we use an optical fabric at 100Tb/s with 100% throughput?
  • Conventional answer: No.
    • Need to reconfigure switch too often
    • 100% throughput requires complex electronic scheduler.
outline
Outline
  • How to guarantee 100% throughput?
  • How to eliminate the scheduler?
  • How to use an optical switch fabric?
  • How to make it scalable and practical?
100 throughput

R

R

?

R

R

?

Out

?

R

R

?

R

R

R

R

?

R

R

R

?

R

Out

?

R

R

R

R

?

?

R

Out

Switch capacity = N2R

Router capacity = NR

100% Throughput

In

In

In

if traffic is uniform

R

R/N

R/N

Out

R/N

R/N

R

R

R

R/N

R/N

Out

R/N

R

R/N

R/N

Out

If traffic is uniform

R

In

R

In

R

In

real traffic is not uniform

R

R

R

R

?

R/N

In

R

R/N

Out

R/N

R/N

R

R

R

R

R

In

R

R

R/N

R/N

Out

R/N

R

R

R

R/N

In

R/N

Out

Real traffic is not uniform
two stage load balancing switch

Out

Out

Out

Out

Out

100% throughput for weakly mixing, stochastic traffic.

[C.-S. Chang, Valiant]

Two-stage load-balancing switch

R

R

R

R/N

R/N

In

Out

R/N

R/N

R/N

R/N

R/N

R/N

R

R

R

In

R/N

R/N

R/N

R/N

R/N

R/N

R

R

R

R/N

R/N

In

R/N

R/N

Load-balancing stage

Switching stage

slide21

Out

Out

Out

R

R

In

3

3

3

R/N

R/N

1

R/N

R/N

R/N

R/N

R/N

R/N

R

R

In

2

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R

R

R/N

In

3

R/N

R/N

slide22

Out

Out

Out

R

R

In

R/N

R/N

1

R/N

R/N

3

R/N

R/N

R/N

R/N

R

R

In

2

R/N

R/N

3

R/N

R/N

R/N

R/N

R/N

R

R

R/N

In

3

R/N

R/N

3

chang s load balanced switch good properties
Chang’s load-balanced switchGood properties
  • 100% throughput for broad class of traffic
  • No scheduler needed a Scalable
chang s load balanced switch bad properties

FOFF: Load-balancing algorithm

    • Packet sequence maintained
    • No pathological patterns
    • 100% throughput - always
    • Delay within bound of ideal
    • (See paper for details)
Chang’s load-balanced switchBad properties
  • Packet mis-sequencing
  • Pathological traffic patterns a Throughput 1/N-th of capacity
  • Uses two switch fabricsa Hard to package
  • Doesn’t work with some linecards missinga Impractical
single mesh switch

One linecard

R

R

Out

R

R

Out

R

R

Out

Single Mesh Switch

2R/N

In

2R/N

2R/N

2R/N

In

2R/N

2R/N

2R/N

2R/N

In

2R/N

packaging

2R/N

2R/N

Backplane

Out

R

2R/N

2R/N

2R/N

2R/N

Out

R

2R/N

2R/N

R/N

Out

R

Packaging

R

In

R

In

R

In

many fabric options

C1, C2, …, CN

C1

C2

C3

CN

In

In

In

In

Out

Out

Out

Out

Many fabric options

N channels each at rate 2R/N

Any permutation

network

Options

Space: Full uniform mesh

Time: Round-robin crossbar

Wavelength: Static WDM

static wdm switching

A, A, A, A

A, B, C, D

B, B, B, B

A, B, C, D

C, C, C, C

A, B, C, D

D, D, D, D

A, B, C, D

4 WDM channels,

each at rate 2R/N

In

In

In

In

Out

Out

Out

Out

Static WDM switching

Array

Waveguide

Router

(AWGR)

Passive andAlmost ZeroPower

A

B

C

D

linecard dataflow

2

2

2

2

2

2

l1

R

l1, l2,.., lN

WDM

lN

R

l1

l1, l2,.., lN

R

R

WDM

2

lN

Out

l1

R

l1, l2,.., lN

R

1

1

1

1

WDM

lN

Linecard dataflow

In

l1

l1, l2,.., lN

R

R

WDM

lN

1

3

1

1

1

1

2

3

4

1

1

1

1

problems of scale
Problems of scale
  • For N < 64, WDM is a good solution.
  • We want N = 640.
  • Need to decompose.
decomposing the mesh
Decomposing the mesh

2R/8

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

decomposing the mesh32

WDM

TDM

Decomposing the mesh

1

2R/8

2R/8

1

2R/4

2R/8

2R/8

2

2

3

3

4

4

5

5

6

6

7

7

8

8

when n is too large decompose into groups or racks

1

L

1

2

2

L

When N is too largeDecompose into groups (or racks)

Group/Rack 1

2R

Array

Waveguide

Router

(AWGR)

l1, l2, …, lG

2R

1

2R

Group/Rack G

2R

l1, l2, …, lG

2R

G

2R

when a linecard is missing
When a linecard is missing
  • Each linecard spreads its data equally over every other linecard.
  • Problem: If one is missing, or failed, then the spreading no longer works.
when a linecard fails

R

R

2R/3 + 2R/3 = (4/3)R

2R/3 + 2R/6 + 2R/3 + 2R/6 = 2R

2R/3 + 2R/6

Out

2R/3 + 2R/6

R

R

Out

R

R

2R/3 + 2R/6

Out

2R/3 + 2R/6

When a linecard fails

2R/3

In

2R/3

2R/3

  • Solution:
  • Move light beams
    • Replace AWGR with MEMS switch.
    • Reconfigure when linecard added, removed or fails.
  • Finer channel granularity
    • Multiple paths.

2R/3

In

2R/3

2R/3

2R/3

2R/3

In

2R/3

solution use transparent mems switches

1

MEMS

Switch

G

1

MEMS

Switch

G

1

MEMS

Switch

G

L

1

2

1

2

L

SolutionUse transparent MEMS switches

Group/Rack 1

MEMS switches reconfigured only when linecard added, removed or fails.

2R

2R

2R

Group/RackG=40

2R

2R

2R

Theorems:

1. Require L+G-1 MEMS switches

2. Polynomial time reconfiguration algorithm

slide37

Middle-Stage

First-Stage

Final-Stage

GxG

Middle

Switch

1

1

Linecard 1

Linecard 1

2

2

LxM

Local

Switch

MxL

Local

Switch

Linecard 2

Linecard 2

3

3

1

Linecard L

M

M

Linecard L

GxG

Middle

Switch

Group 1

Group 1

1

1

2

Linecard 1

Linecard 1

2

2

LxM

Local

Switch

MxL

Local

Switch

Linecard 2

Linecard 2

3

3

GxG

Middle

Switch

Linecard L

M

M

Linecard L

3

Group 2

Group 2

1

1

Linecard 1

Linecard 1

2

2

LxM

Local

Switch

MxL

Local

Switch

Linecard 2

Linecard 2

3

3

GxG

Middle

Switch

Linecard L

M

M

Linecard L

M

Group G

Group G

Hybrid Architecture: Logical View

hybrid electro optical architecture

Static

MEMS

Electronic

Switches

Fixed

Lasers

Optical

Receivers

Electronic

Switches

GxG

MEMS

1

1

Linecard 1

Linecard 1

LxM

Crossbar

MxL

Crossbar

2

2

Linecard 2

Linecard 2

3

3

1

Linecard L

Linecard L

M

M

GxG

MEMS

Group 1

Group 1

1

1

Linecard 1

Linecard 1

2

LxM

Crossbar

MxL

Crossbar

2

2

Linecard 2

Linecard 2

3

3

GxG

MEMS

Linecard L

Linecard L

M

M

3

Group 2

Group 2

1

1

Linecard 1

Linecard 1

LxM

Crossbar

MxL

Crossbar

2

2

Linecard 2

Linecard 2

3

3

GxG

MEMS

Linecard L

Linecard L

M

M

M

Group G

Group G

Hybrid Electro-Optical Architecture
number of mems switches
Number of MEMS Switches

R

R

R

Linecard 1

Crossbar

Crossbar

Linecard 1

R

R

Linecard 2

Linecard 2

R

R

Linecard 3

Crossbar

Crossbar

Linecard 3

R

R

R

R

R

Linecard 4

Linecard 4

StaticMEMS

R

R

R

Linecard 1

Crossbar

Crossbar

Linecard 1

R

R

R

Linecard 2

Linecard 2

R

R

Linecard 3

Crossbar

Crossbar

Linecard 3

R

R

R

R

Linecard 3

Linecard 4

number of mems switches40
Number of MEMS Switches

R

R

4R/3

Linecard 1

Crossbar

Crossbar

Linecard 1

R

R

Linecard 2

Linecard 2

R

R

Linecard 3

Crossbar

Crossbar

Linecard 3

2R/3

2R/3

R/3

StaticMEMS

R

R

R

Linecard 1

Crossbar

Crossbar

Linecard 1

R/3

R

2R/3

R

Linecard 2

Linecard 2

R/3

R

R

Linecard 3

Crossbar

Crossbar

Linecard 3

2R/3

number of mems needed for a schedule
Number of MEMS needed for a schedule
  • Li: number of linecards in group i, 1 ≤ i ≤ G. Group i needs to send to group j:
  • Assume each group can send at most R to each MEMS. Number of MEMS needed between groups i and j:
number of mems needed for a schedule42
Number of MEMS needed for a schedule
  • The number of MEMS needed for group i to send to group j is Aij.
  • The total number of MEMS needed for group i is the sum of the Aij’s
constraints for the tdm schedule
Constraints for the TDM Schedule
  • Latin Square: In any period N, each transmitting linecard is connected to each receiving linecard exactly once.
  • MEMS constraint: In any time-slot, there are at most Aijconnections between transmitting group i and receiving group j, where:
example
Example
  • Assume L1=3, L2=2, L3=1
  • Then
  • E.g., at most 2 packets from the first group to the first group at each time-slot
configuration algorithm
Configuration Algorithm
  • Assign connections between groups, so MEMS constraint is satisfied.
  • Assign group connections to specific linecards, so there is exactly one connection per linecard pair in the schedule.

Comments:

  • Algorithm is surprisingly complex.
  • Best running time so far: 40 seconds for 640 linecards.
challenges

Low-cost, low-power optoelectronic conversion?

l1

Pkt

Switch

How to build a 250ms

160Gb/s buffer?

WDM

lG

l1

R

R

WDM

lG

Challenges

In

l1

Address

Lookup

l1, l2,.., lG

R

R

WDM

lG

R

l1, l2,.., lG

l1, l2,.., lG

1

1

1

2

2

R=160Gb/s

3

4

Out

l1

R

l1, l2,.., lG

R

WDM

lG

what we are building

Chip #2: 16 x 55

Opto-electronic crossbar

55 x 10Gb/s

55 x 10Gb/s

1500nm Optical source

16 x 10Gb/s

CMOS ASIC

To Linecards

To Optical Fabric

What we are building

250ms DRAM

320Gb/s

Chip #1: 160Gb/s Packet Buffer

Buffer Manager

90nm ASIC

160Gb/s

160Gb/s

Optical Detector

Optical Modulator

100tb s load balanced router

40 x 40

MEMS

Linecard Rack 1

Linecard Rack G = 40

Switch Rack < 100W

L = 16

160Gb/s

linecards

L = 16

160Gb/s

linecards

1

2

55

56

100Tb/s Load-Balanced Router

L = 16

160Gb/s

linecards