- By
**livvy** - Follow User

- 110 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Switch Architectures' - livvy

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- I. Introduction
- II. System Model
- III. The Least Cushion First/Most Urgent First Algorithm
- IV. Conclusion

Ⅰ. Introduction

- Exponential growth of Internet traffic demands large scale switches
- Common Switch Architectures
- Output Queued
- High performance
- Easier to provide QoS guarantee
- Has serious scaling problem
- Input Queued
- More scalable
- Suffers from HOL blocking
- Virtual Output Queues can improve performance
- Difficult to provide QoS guarantee

Ⅰ. Introduction

Memory BW requirements for three common switch architectures:

S ：link speed

N：switch size (N×N)

- Input queueing is necessary !
- Can speedup the switch to improve performance CIOQ switch

Output 1

CIOQ Switch

. . .

. . .

Input N

Output N

Identical

Input Traffic

Output 1

Input 1

Emulated

. . .

. . .

OQ Switch

Output N

Input N

Identical

Departure Pattern

Ⅰ. IntroductionExact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.

Ⅰ. Introduction

- We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm
- O(N) complexity with parallel comparators
- Exactly emulates an OQ switch with a speedup of 2 times
- No constraint on service discipline

Ⅱ. System Model

- Switch fabric is speeded up by a factor of 2
- There are 2 scheduling phases in slot k, referred to as phase k.1 and phase k.2
- A cell delivered to its destined output port in phase k.1 can be transmitted out of the output port in the same slot (i.e., cut through)
- A cell delivered in phase k.2 can only be transmitted in slot k+1 or after

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

- Let denote a cell at input port i destined to output port j
- Definition 1: The cushion of cell :
- The number of cells residing in output port j which will depart the emulated OQ switch earlier than cell
- Definition 2: The cushion between input port i and output port j:
- The minimum of for all cells at input port i destined to output port j
- If there is no cell destined to output port j, then is set to

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

- Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals
- Definition 4: The input thread of cell at input port i:
- The set of cells at input port i which has a cushion smaller than or equal to except cell itself
- Let denote the size of

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

- LCF / MUF Algorithm
- Step 1:
- Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop.
- If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port).
- For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

- LCF / MUF Algorithm
- Step 2:
- Eliminate the ith row and the jth column (i.e., match output port j to input port i) of the scheduling matrix.
- If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1.
- Consider for example the scheduling matrix given in page 13

Ⅳ. Conclusion

- We propose a new scheduling algorithm - the least cushion first /most urgent first algorithm
- Exactly emulates an OQ switch
- No constraint on service discipline
- Implement issues of the LCF / MUF algorithm
- A switch has to know the cushions of all cells and the relative departure order of cells destined to the same output port
- It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ)
- Feasible for static priority assignment schemes

Outline

- Systolic Array
- Binary Heap
- Pipelined Heap
- Hardware Design

The Systolic Array Priority Queue

Highest value

New value

Block n

Block 3

Block 2

Block 1

Permanent Data Register

Temporary Register

NON-INCREASING PRIORITY VALUES

n = 1000

Hardware required: 1000 comparators, 2000 registers.

Performance: constant time.

The Binary Heap Priority Queue

1

16

2

3

14

10

4

5

6

7

4

7

8

3

8

9

10

11

12

3

2

3

5

7

1

2

3

4

5

6

7

8

9

10

11

12

13

15

14

VALUE

16

14

10

4

7

8

3

2

3

3

5

7

n =1000

Hardware required: 1 comparator, 1 register, 1 SRAM.

Performance: O(log n).

The Pipelined-Heap

- Modified binary heap data structure
- Constant-time operation. Similar to the Systolic Array.
- Good hardware scalability. Similar to the Binary Heap.

Token Array(T)

operation

value

position

1

Level 1

16

2

3

Level 2

14

10

4

5

6

7

Level 3

4

7

7

3

8

9

10

11

12

13

14

15

Level 4

2

1

5

8

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

value

16

14

10

4

7

7

3

2

1

5

8

capacity

4

1

3

1

0

1

2

0

1

0

0

1

0

1

1

P-heap Data Structure (B,T)The Enqueue (Insert) Operation

operation

value

position

1

operation

value

position

1

enq

9

1

16

16

2

3

2

3

14

10

enq

9

2

14

10

4

5

6

7

4

5

6

7

8

7

3

8

7

3

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

2

4

5

2

4

5

(a) local-enqueue(1)

(b) local-enqueue(2)

Enqueue (contd)

operation

value

position

1

operation

value

position

1

16

16

2

3

2

3

14

10

14

10

4

5

6

7

4

5

6

7

8

9

3

enq

9

5

8

7

3

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

enq

7

10

2

4

5

2

4

5

(c) local-enqueue(3)

(d) local-enqueue(4)

operation

value

position

1

16

2

3

14

10

4

5

6

7

8

9

3

8

9

10

11

12

13

14

15

2

4

7

5

(e)

value

position

1

16

2

3

14

10

4

5

6

7

8

7

3

8

9

10

11

12

13

14

15

2

4

5

(a)

The Dequeue (Delete) Operationoperation

value

position

1

deq

1

2

3

14

10

4

5

6

7

8

7

3

8

9

10

11

12

13

14

15

2

4

5

(b) local-dequeue(1)

Dequeue (contd)

1

operation

value

position

1

operation

value

position

14

14

2

3

2

3

8

10

deq

2

10

4

5

6

7

4

5

6

7

deq

4

7

3

8

7

3

8

9

10

11

12

13

14

15

8

9

10

11

12

13

14

15

2

4

5

2

4

5

(d) local-dequeue(3)

(c) local-dequeue(2)

operation

value

position

1

14

2

3

8

10

4

5

6

7

4

7

3

8

9

10

11

12

13

14

15

2

5

(e)

Hardware Requirements

- log N SRAMs represent the Binary Array B, N = size of the P-heap .
- log N registers represent the Token Array T.
- log N comparators required, one for each level of the P-heap.

Binary Heap

1

Left(i) = 2*i

Right(i) = 2*i + 1

Parent(i) = i / 2

A[i] >= A[Left(i)]

A[i] >= A[Right(i)]

16

2

3

11

12

4

5

6

8

11

9

viewed as a binary tree

1

2

3

4

5

6

16

11

12

8

11

9

viewed as an array

Binary Heap : Insert Operation

1

1

16

16

2

3

2

3

11

12

11

14

4

5

6

7

4

5

6

7

8

10

9

14

8

10

9

12

viewed as a binary tree

viewed as a binary tree

1

2

3

4

5

6

7

1

2

3

4

5

6

7

16

11

12

8

10

9

14

16

11

14

8

10

9

12

viewed as an array

viewed as an array

Binary Heap : Delete Operation

1

1

1

16

16

12

14

2

3

2

3

2

3

11

14

11

14

11

12

4

5

6

7

4

5

6

4

5

6

8

10

9

12

8

10

9

8

10

9

viewed as a binary tree

viewed as a binary tree

viewed as a binary tree

1

2

3

4

5

6

7

1

2

3

4

5

6

1

2

3

4

5

6

16

11

14

8

10

9

12

12

11

14

8

10

9

14

11

12

8

10

9

viewed as an array

viewed as an array

viewed as an array

Binary Heap Operations

- Both insert and delete are O(log N) operations (i.e. number of levels in the tree)
- 2*i can be implemented as left shift
- i / 2 can be implemented as right shift

Some scheduling algorithm

- Outline
- PIM
- RRM
- iSLIP (Better solution)

Scheduling Algorithms

- When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs.
- This is equivalent to find a bipartite matching on a graph with N vertices.
- The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.

Input side

P(1,1)=1

P(1,2)=3

Crossbar

Switch

P(3,2)=3

P(3,4)=1

P(4,4)=2

Scheduling packets- For Example

P( input #, output #) = order to leave

Scheduling Algorithm need to decide the path and order of packets

through crossbar switch

High performance systems

- Usually, we design algorithm with the following properties:
- High Throughput
- Starvation Free
- Fast
- Simple to Implement

Parallel Iterative Matching (PIM)

- PIM has three steps to implement
- Step1 : Request
- Step2 : Grant
- Step3 : Accept
- Each decision is made randomly.

The mathematics model of algorithm

- We can assume that
- Every input in[i] maintains the following state information:
- Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a request for Out[k] (0, otherwise)
- Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i] receives a grant from Out[k] (0, otherwise)
- Variable Ai, where Ai = k, if In[i] accepts the grant from Out[k] (-1, if no output is accepted).

The mathematics model (cond’t)

- Every output Out[k] maintains the following state information:
- Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise)
- Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted)
- Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).

The model of PIM

- Therefore, we can represent PIM algorithm as

P(1,2)=3

P(3,2)=3

P(3,4)=1

P(4,4)=2

(a)

(b)

(c)

An example of PIM algorithmSecond

iteration

Request

Grant

Accept

Problems with PIM

- Hard to implement randomness in hardware
- Unfairness occurs among connections under oversubscribed situation
- Throughput is limited to approximately 63% for a single iteration

Round-Robin Matching Algorithm (RRM)

- Use rotating priority to match inputs and outputs
- Need a pointer gi to identify the highest priority element
- Apply rotating priority on both inputs and outputs

μ1,1= μ1,2=1/4

λ2,1= λ 2,2=1

μ 2,1= μ 2,2=1/4

Synchronization Problem- When an output receives a request, the output should choose an input to grant and gi must vary to a new value
- For example

Efficiency = 50%

iSLIP algorithm

- Use to fix synchronization problem of RRM
- Changes its pointer gi only when the grant is accepted by the input, or the pointer gi will keep its value
- Solves the synchronous problem and achieves 100% throughput

Download Presentation

Connecting to Server..