Switch architectures
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

Switch Architectures PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Switch Architectures. Input Queued, Output Queued, Combined Input and Output Queued. Outline. I. Introduction II. System Model III. The Least Cushion First/Most Urgent First Algorithm IV. Conclusion. Ⅰ. Introduction. Exponential growth of Internet traffic demands large scale switches

Download Presentation

Switch Architectures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Switch architectures

Switch Architectures

Input Queued, Output Queued,

Combined Input and Output Queued


Outline

Outline

  • I. Introduction

  • II. System Model

  • III. The Least Cushion First/Most Urgent First Algorithm

  • IV. Conclusion


Introduction

Ⅰ. Introduction

  • Exponential growth of Internet traffic demands large scale switches

  • Common Switch Architectures

    • Output Queued

      • High performance

      • Easier to provide QoS guarantee

      • Has serious scaling problem

    • Input Queued

      • More scalable

      • Suffers from HOL blocking

      • Virtual Output Queues can improve performance

      • Difficult to provide QoS guarantee


Output queued shared bus

Output Queued-Shared Bus

Output Port

Input Port

1

1

1

2

2

3

4

3

4


Output queued shared memory

Output Queued-Shared Memory

Output Port

Input Port

Memory

1

1

2

2

3

3

4

4


Input queued

Input port:

1

2

3

4

OUTPUT PORT:

1

2

3

4

Input Queued


Input queued with voq

Input port:

For output port:

1

1

1

1

1

2

2

2

2

3

3

3

3

4

4

4

4

2

3

4

OUTPUT PORT:

1

2

3

4

Input Queued with VOQ


Introduction1

Ⅰ. Introduction

Memory BW requirements for three common switch architectures:

S :link speed

N:switch size (N×N)

  • Input queueing is necessary !

  • Can speedup the switch to improve performance CIOQ switch


Introduction2

Ⅰ. Introduction

Matching Algorithms for Performance Improvement:

matching


Introduction3

Input 1

Output 1

CIOQ Switch

. . .

. . .

Input N

Output N

Identical

Input Traffic

Output 1

Input 1

Emulated

. . .

. . .

OQ Switch

Output N

Input N

Identical

Departure Pattern

Ⅰ. Introduction

Exact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.


Introduction4

Ⅰ. Introduction

  • We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm

    • O(N) complexity with parallel comparators

    • Exactly emulates an OQ switch with a speedup of 2 times

    • No constraint on service discipline


System model

Switching

Fabrics

Speedup=2

Ⅱ. SystemModel


System model1

Ⅱ. System Model

  • Switch fabric is speeded up by a factor of 2

    • There are 2 scheduling phases in slot k, referred to as phase k.1 and phase k.2

    • A cell delivered to its destined output port in phase k.1 can be transmitted out of the output port in the same slot (i.e., cut through)

    • A cell delivered in phase k.2 can only be transmitted in slot k+1 or after


System model2

Ⅱ. System Model


The least cushion first most urgent first algorithm

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

  • Let denote a cell at input port i destined to output port j

  • Definition 1: The cushion of cell :

    • The number of cells residing in output port j which will depart the emulated OQ switch earlier than cell

  • Definition 2: The cushion between input port i and output port j:

    • The minimum of for all cells at input port i destined to output port j

    • If there is no cell destined to output port j, then is set to


The least cushion first most urgent first algorithm1

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

  • Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals

  • Definition 4: The input thread of cell at input port i:

    • The set of cells at input port i which has a cushion smaller than or equal to except cell itself

    • Let denote the size of


The least cushion first most urgent first algorithm2

Ⅲ. The Least Cushion First / Most Urgent First Algorithm


The least cushion first most urgent first algorithm3

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

  • LCF / MUF Algorithm

  • Step 1:

    • Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop.

    • If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port).

    • For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).


The least cushion first most urgent first algorithm4

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

  • LCF / MUF Algorithm

  • Step 2:

    • Eliminate the ith row and the jth column (i.e., match output port j to input port i) of the scheduling matrix.

    • If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1.

  • Consider for example the scheduling matrix given in page 13


Conclusion

Ⅳ. Conclusion

  • We propose a new scheduling algorithm - the least cushion first /most urgent first algorithm

    • Exactly emulates an OQ switch

    • No constraint on service discipline

  • Implement issues of the LCF / MUF algorithm

    • A switch has to know the cushions of all cells and the relative departure order of cells destined to the same output port

    • It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ)

    • Feasible for static priority assignment schemes


Outline1

Outline

  • Systolic Array

  • Binary Heap

  • Pipelined Heap

  • Hardware Design


The systolic array priority queue

The Systolic Array Priority Queue

Highest value

New value

Block n

Block 3

Block 2

Block 1

Permanent Data Register

Temporary Register

NON-INCREASING PRIORITY VALUES

n = 1000

Hardware required: 1000 comparators, 2000 registers.

Performance: constant time.


The binary heap priority queue

The Binary Heap Priority Queue

1

16

2

3

14

10

4

5

6

7

4

7

8

3

8

9

10

11

12

3

2

3

5

7

1

2

3

4

5

6

7

8

9

10

11

12

13

15

14

VALUE

16

14

10

4

7

8

3

2

3

3

5

7

n =1000

Hardware required: 1 comparator, 1 register, 1 SRAM.

Performance: O(log n).


The pipelined heap

The Pipelined-Heap

  • Modified binary heap data structure

  • Constant-time operation. Similar to the Systolic Array.

  • Good hardware scalability. Similar to the Binary Heap.


P heap data structure b t

Binary Array(B)

Token Array(T)

operation

value

position

1

Level 1

16

2

3

Level 2

14

10

4

5

6

7

Level 3

4

7

7

3

8

9

10

11

12

13

14

15

Level 4

2

1

5

8

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

value

16

14

10

4

7

7

3

2

1

5

8

capacity

4

1

3

1

0

1

2

0

1

0

0

1

0

1

1

P-heap Data Structure (B,T)


The enqueue insert operation

The Enqueue (Insert) Operation

operation

value

position

1

operation

value

position

1

enq

9

1

16

16

2

3

2

3

14

10

enq

9

2

14

10

4

5

6

7

4

5

6

7

8

7

3

8

7

3

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

2

4

5

2

4

5

(a) local-enqueue(1)

(b) local-enqueue(2)


Enqueue contd

Enqueue (contd)

operation

value

position

1

operation

value

position

1

16

16

2

3

2

3

14

10

14

10

4

5

6

7

4

5

6

7

8

9

3

enq

9

5

8

7

3

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

enq

7

10

2

4

5

2

4

5

(c) local-enqueue(3)

(d) local-enqueue(4)

operation

value

position

1

16

2

3

14

10

4

5

6

7

8

9

3

8

9

10

11

12

13

14

15

2

4

7

5

(e)


The dequeue delete operation

operation

value

position

1

16

2

3

14

10

4

5

6

7

8

7

3

8

9

10

11

12

13

14

15

2

4

5

(a)

The Dequeue (Delete) Operation

operation

value

position

1

deq

1

2

3

14

10

4

5

6

7

8

7

3

8

9

10

11

12

13

14

15

2

4

5

(b) local-dequeue(1)


Dequeue contd

Dequeue (contd)

1

operation

value

position

1

operation

value

position

14

14

2

3

2

3

8

10

deq

2

10

4

5

6

7

4

5

6

7

deq

4

7

3

8

7

3

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

2

4

5

2

4

5

(d) local-dequeue(3)

(c) local-dequeue(2)

operation

value

position

1

14

2

3

8

10

4

5

6

7

4

7

3

8

9

10

11

12

13

14

15

2

5

(e)


Pipelined operation

Pipelined Operation

level

level

1

1

2

2

3

3

4

4

5

5

6

6

level

level

1

1

2

2

3

3

4

4

5

5

6

6


Hardware requirements

Hardware Requirements

  • log N SRAMs represent the Binary Array B, N = size of the P-heap .

  • log N registers represent the Token Array T.

  • log N comparators required, one for each level of the P-heap.


Binary heap

Binary Heap

1

Left(i) = 2*i

Right(i) = 2*i + 1

Parent(i) = i / 2

A[i] >= A[Left(i)]

A[i] >= A[Right(i)]

16

2

3

11

12

4

5

6

8

11

9

viewed as a binary tree

1

2

3

4

5

6

16

11

12

8

11

9

viewed as an array


Binary heap insert operation

Binary Heap : Insert Operation

1

1

16

16

2

3

2

3

11

12

11

14

4

5

6

7

4

5

6

7

8

10

9

14

8

10

9

12

viewed as a binary tree

viewed as a binary tree

1

2

3

4

5

6

7

1

2

3

4

5

6

7

16

11

12

8

10

9

14

16

11

14

8

10

9

12

viewed as an array

viewed as an array


Binary heap delete operation

Binary Heap : Delete Operation

1

1

1

16

16

12

14

2

3

2

3

2

3

11

14

11

14

11

12

4

5

6

7

4

5

6

4

5

6

8

10

9

12

8

10

9

8

10

9

viewed as a binary tree

viewed as a binary tree

viewed as a binary tree

1

2

3

4

5

6

7

1

2

3

4

5

6

1

2

3

4

5

6

16

11

14

8

10

9

12

12

11

14

8

10

9

14

11

12

8

10

9

viewed as an array

viewed as an array

viewed as an array


Binary heap operations

Binary Heap Operations

  • Both insert and delete are O(log N) operations (i.e. number of levels in the tree)

  • 2*i can be implemented as left shift

  • i / 2 can be implemented as right shift


Some scheduling algorithm

Some scheduling algorithm

  • Outline

    • PIM

    • RRM

    • iSLIP (Better solution)


Scheduling algorithms

Scheduling Algorithms

  • When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs.

  • This is equivalent to find a bipartite matching on a graph with N vertices.

  • The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.


Scheduling packets

Output side

Input side

P(1,1)=1

P(1,2)=3

Crossbar

Switch

P(3,2)=3

P(3,4)=1

P(4,4)=2

Scheduling packets

  • For Example

P( input #, output #) = order to leave

Scheduling Algorithm need to decide the path and order of packets

through crossbar switch


High performance systems

High performance systems

  • Usually, we design algorithm with the following properties:

    • High Throughput

    • Starvation Free

    • Fast

    • Simple to Implement


Parallel iterative matching pim

Parallel Iterative Matching (PIM)

  • PIM has three steps to implement

    • Step1 : Request

    • Step2 : Grant

    • Step3 : Accept

  • Each decision is made randomly.


The mathematics model of algorithm

The mathematics model of algorithm

  • We can assume that

  • Every input in[i] maintains the following state information:

    • Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a request for Out[k] (0, otherwise)

    • Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i] receives a grant from Out[k] (0, otherwise)

    • Variable Ai, where Ai = k, if In[i] accepts the grant from Out[k] (-1, if no output is accepted).


The mathematics model cond t

The mathematics model (cond’t)

  • Every output Out[k] maintains the following state information:

    • Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise)

    • Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted)

    • Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).


The model of pim

The model of PIM

  • Therefore, we can represent PIM algorithm as


An example of pim algorithm

P(1,1)=1

P(1,2)=3

P(3,2)=3

P(3,4)=1

P(4,4)=2

(a)

(b)

(c)

An example of PIM algorithm

Second

iteration

Request

Grant

Accept


Problems with pim

Problems with PIM

  • Hard to implement randomness in hardware

  • Unfairness occurs among connections under oversubscribed situation

  • Throughput is limited to approximately 63% for a single iteration


The unfairness problem

λ1,1=1

μ1,1=1/4

λ1,2=1

μ 1,2=3/4

μ 2,1=3/4

λ2,1=1

The unfairness problem


Round robin matching algorithm rrm

Round-Robin Matching Algorithm (RRM)

  • Use rotating priority to match inputs and outputs

  • Need a pointer gi to identify the highest priority element

  • Apply rotating priority on both inputs and outputs


The model of rrm

The model of RRM


Rrm scheduling

a1

g2

4

4

4

1

1

1

4

1

2

P(1,1)=1

P(1,2)=3

3

3

3

2

2

2

P(3,2)=3

P(3,4)=1

P(4,4)=2

g4

(a)

(b)

(c)

RRM scheduling


Synchronization problem

λ1,1= λ1,2 =1

μ1,1= μ1,2=1/4

λ2,1= λ 2,2=1

μ 2,1= μ 2,2=1/4

Synchronization Problem

  • When an output receives a request, the output should choose an input to grant and gi must vary to a new value

  • For example

Efficiency = 50%


Islip algorithm

iSLIP algorithm

  • Use to fix synchronization problem of RRM

  • Changes its pointer gi only when the grant is accepted by the input, or the pointer gi will keep its value

  • Solves the synchronous problem and achieves 100% throughput


The model of islip

The model of iSLIP


Example of islip

Example of iSLIP

1st match

λ1,1= λ1,2 =1

2nd match

3rd match

λ2,1= λ 2,2=1

100% throughput is achieved


Comparison of three algorithms

Comparison of three algorithms


  • Login