Low Power Architecture for High Speed Packet Classification

Low Power Architecture for High Speed Packet Classification Author: Alan Kennedy, Xiaojun Wang Zhen Liu, Bin Liu Publisher: ANCS 2008 Presenter: Chun-Yi Li Date: 2009/05/06

Outline • Introduction • Adaptive Clocking Architecture • Hardware Accelerator • Hierarchical Intelligent Cutting (HiCut) • Multidimensional Cutting (HyperCuts) • Algorithm Changes • Low Power Architecture • Performance

Introduction • Optical Carrier levels describe a range of digital signals that can be carried on SONET fiber optic network. • The number in the Optical Carrier level is directly proportional to the data rate of the bitstream carried by the digital signal.

Introduction • Implementing packet classification algorithms in software is not feasible when trying to achieve high speed packet classification. • High throughput algorithms such as RFC are unable to reach OC-768 or even OC-192 line rates when run on devices such as general purpose processors for even relatively small sized rulesets

Introduction • A large percentage of idle time means a large amount of unnecessary dynamic power is being used due to the unnecessary switching of logic and memory elements. Percentage of time classifier spends idle when classifying packets from the CENIC trace at different frequencies

Outline Introduction Adaptive Clocking Architecture Hardware Accelerator Hierarchical Intelligent Cutting (HiCut) Multidimensional Cutting (HyperCuts) Algorithm Changes Low Power Architecture Performance

Adaptive Clocking Architecture • The adaptive clocking unit is designed to run a packet classification hardware accelerator at up to N different frequencies. • For our packet classifier, it was found that 32MHz is fast enough to deal with the worst case bursts of packets for OC-768 line speeds. This means that Fmax=32MHz fi = Fmax/2n-i-1, i=0,1,...,N-1

Adaptive Clocking Architecture • This threshold is variable with the number of packets stored in the buffer distributed among the N states with each state having a width Wi , 0≦ Wi ≦M N-1 M = ΣWi i=0 M Buffer WN-1 W0 W1

Adaptive Clocking Architecture • The threshold for determining when a state is exited and the next higher state entered is saved in a register in the adaptive clocking unit and can be changed at any time. i Ti = ΣWj , i=0,1,...,N-2 j=0 T1 T0 Buffer W0 WN-1 W1

Adaptive Clocking Architecture • The output clock frequency to the packet classification hardware accelerator will start at the frequency of the lowest-used state f0 . • If the threshold for this state T0 is exceeded then the next higher-used state S1 will be entered and the clock frequency will change to f1 S0 S1 S2 S4 S6 S8 S3 S5 S7 S9

Adaptive Clocking Architecture • Only states S4, S7, S8 and S9 are used. • In this case the output clock frequency to the packet classifier will start at f1 . S0 S1 S2 S4 S6 S8 S3 S5 S7 S9

Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) • The algorithm constructs the decision tree by recursively cutting the hyperspace one dimension at a time into sub regions. • The algorithm will keep cutting into the hyperspace until none of the sub regions exceed a predetermined number called binth.

Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) Field2 (4 cuts) 11 00 01 10 R2 R3 R4 R7 R10 R11 R8 R9 R10 R11 R9 R10 R11 R0 R1 R5 R6 R7 R10 R11 binth = 4

Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) Field2 (4 cuts) Field4 (4 cuts) Field3 (4 cuts) R9 R10 R11 R8 R9 R10 R11 R7 R10 R11 R3 R7 R10 R11 R2 R7 R10 R11 R4 R7 R10 R11 R7 R10 R11 R1 R7 R10 R11 R7 R10 R11 R0 R5 R6 R7 R10 R11 binth = 4

Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) Field2 (4 cuts) R9 R10 R11 R8 R9 R10 R11 Field4 (4 cuts) Field3 (4 cuts) Field5 R7 R10 R11 R3 R7 R10 R11 R2 R7 R10 R11 R4 R7 R10 R11 R7 R10 R11 R1 R7 R10 R11 R7 R10 R11 R7 R11 R0 R5 R6 R10 binth = 4

Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut)

Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) • binth: Limits the amount of linear searching at leaves. • np: Number of cuts. • spfac: A multiplier which limits the amount of storage increase caused by executing cuts at a node. • spfac*number of rules at i ≤∑rules at each child of i+np

Hardware Accelerator Multidimensional Cutting (HyperCuts) • The main difference from HiCuts is that HyperCuts recursively cuts the hyperspace into sub regions by performing cuts on multiple dimensions at a time.

Hardware Accelerator Multidimensional Cutting (HyperCuts) Field1 (2 cuts) Field5 (2 cuts) R0 R4 R6 R1 R3 R7R8 R9 R0 R2 R5 binth = 4

Hardware Accelerator Multidimensional Cutting (HyperCuts) • spfac: A multiplier which limits the amount of storage increase caused by executing cuts at a node. • max child nodes at i ≤ spfac*sqrt( number of rules at i)

Hardware Accelerator Multidimensional Cutting (HyperCuts) • Region Compaction A node in the decision tree originally covers the region {[Xmin, Xmax], [Ymin, Ymax]}. However all the rules that are associated with the node are only covered by the subregion {[X’min, X’max], [Y’min, Y’max]}. Using region reduction the area that is associated with the node shrinks to the minimum space which can cover all the rules associated with the node.

Hardware Accelerator Multidimensional Cutting (HyperCuts) • Pushing Common Rule Subsets Upwards An example in which all the child nodes of A share the same subset of rules {R0,R1}. As a result only A will store the subset instead of being replicated in all the children. A A R0 R1 R0 R1 R2 R0 R1 R3 R0 R1 R0 R1 R4 R2 R3 R4

Hardware Accelerator Algorithm Changes • Remove the region compaction and push common rule subsets upwards heuristics from the HyperCuts algorithm.

Hardware Accelerator Algorithm Changes • For HiCuts the number of cuts to an internal node starts at 32 and doubles each time the following condition is met • (spfac*number of rules at i ≤∑rules at each child of i+np) & (np<129)

Hardware Accelerator Algorithm Changes • All combination of cuts between the chosen dimensions are considered if they obey the following condition where spfac can be 1, 2, 3 or 4: • (np ≦ 2(4+spfac) )& (np≧32)

Hardware Accelerator Memory Structure • The hardware accelerator uses 7704-bit wide memory words. • In order to calculate which cut the packet should traverse to, the internal node stores 8-bit mask and shift values for each dimension. • The masks indicate how many cuts are to be made to each dimension while the shift values indicate each dimensions weight. • The cut to be chosen is calculated by ANDing the mask values with the corresponding 8 most significant bits from each of the packets 5 dimensions. The resulting values for each dimension are shifted by the shift values with the results added together giving the cut to be selected.

Hardware Accelerator Memory Structure • Each saved rule uses 160 bits of memory. • The Destination and Source Ports use 32 bits each with 16 bits used for the min and max range values. • The Source and Destination IP addresses use 35 bits each with 32 bits used to store the address and 3 bits for the mask. • The storage requirement for the mask has been reduced from 6 to 3 bits by encoding the mask and storing 3 bits of the encoded mask value in the 3 least significant bits of the IP address when the mask is 0-27. • The protocol number uses 9 bits with 8 bits used to store the number and 1 bit for the mask. • Each 7704-bit memory word can hold up to 48 rules, and it is possible to perform a parallel search of these rules in one clock cycle.

Low Power Architecture

Performance

Performance Power figures for ASIC implementation Power figures for Cyclone 3 implementation

Performance ASIC implementation classifying network traces using rulesets containing 20,000 rules. Cyclone 3 implementation classifying network traces using rulesets containing 20,000 rules.

Conclusion Simulation results show that ASIC and FPGA implementations of our low power architecture can reduce power consumption by between 17-88% by adjusting the frequency.

Low Power Architecture for High Speed Packet Classification

Low Power Architecture for High Speed Packet Classification

Presentation Transcript

Packet Classification

Packet Classification

HSPA High Speed Packet Access

Compact High Speed, High Power Laser

High Speed, Low Power FIR Digital Filter Implementation

Ultra-High Throughput Low-Power Packet Classification

Ultra-High Throughput Low-Power Packet Classiﬁcation

A Low-Power High-Speed Hybrid CMOS Full Adder for Embedded System

Low Power – High Speed MCML Circuits (II)

Enabling Technologies for Low-Power High-Speed Networks

High-performance Architecture for Dynamically Updatable Packet Classification on FPGA

Power Management for High-speed Digital Systems

A FPGA-based Parallel Architecture for Scalable High-Speed Packet Classification

High Speed Low Current Comparator

Packet Classification on PLUG Architecture

High Speed Stable Packet Switches

High-Speed Packet Classification Using Binary Search on Length

Large-Scale Wire-Speed Packet Classification on FPGAs

High Speed Stable Packet Switches

Instruction Set Architecture (ISA) for Low Power