1 / 35

Low Power Architecture for High Speed Packet Classification

Low Power Architecture for High Speed Packet Classification. Author: Alan Kennedy, Xiaojun Wang Zhen Liu, Bin Liu Publisher: ANCS 2008 Presenter: Chun-Yi Li Date: 2009/05/06. Outline. Introduction Adaptive Clocking Architecture Hardware Accelerator

laliberte
Download Presentation

Low Power Architecture for High Speed Packet Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low Power Architecture for High Speed Packet Classification Author: Alan Kennedy, Xiaojun Wang Zhen Liu, Bin Liu Publisher: ANCS 2008 Presenter: Chun-Yi Li Date: 2009/05/06

  2. Outline • Introduction • Adaptive Clocking Architecture • Hardware Accelerator • Hierarchical Intelligent Cutting (HiCut) • Multidimensional Cutting (HyperCuts) • Algorithm Changes • Low Power Architecture • Performance

  3. Introduction • Optical Carrier levels describe a range of digital signals that can be carried on SONET fiber optic network. • The number in the Optical Carrier level is directly proportional to the data rate of the bitstream carried by the digital signal.

  4. Introduction • Implementing packet classification algorithms in software is not feasible when trying to achieve high speed packet classification. • High throughput algorithms such as RFC are unable to reach OC-768 or even OC-192 line rates when run on devices such as general purpose processors for even relatively small sized rulesets

  5. Introduction • A large percentage of idle time means a large amount of unnecessary dynamic power is being used due to the unnecessary switching of logic and memory elements. Percentage of time classifier spends idle when classifying packets from the CENIC trace at different frequencies

  6. Outline Introduction Adaptive Clocking Architecture Hardware Accelerator Hierarchical Intelligent Cutting (HiCut) Multidimensional Cutting (HyperCuts) Algorithm Changes Low Power Architecture Performance

  7. Adaptive Clocking Architecture • The adaptive clocking unit is designed to run a packet classification hardware accelerator at up to N different frequencies. • For our packet classifier, it was found that 32MHz is fast enough to deal with the worst case bursts of packets for OC-768 line speeds. This means that Fmax=32MHz fi = Fmax/2n-i-1, i=0,1,...,N-1

  8. Adaptive Clocking Architecture • This threshold is variable with the number of packets stored in the buffer distributed among the N states with each state having a width Wi , 0≦ Wi ≦M N-1 M = ΣWi i=0 M Buffer WN-1 W0 W1

  9. Adaptive Clocking Architecture • The threshold for determining when a state is exited and the next higher state entered is saved in a register in the adaptive clocking unit and can be changed at any time. i Ti = ΣWj , i=0,1,...,N-2 j=0 T1 T0 Buffer W0 WN-1 W1

  10. Adaptive Clocking Architecture • The output clock frequency to the packet classification hardware accelerator will start at the frequency of the lowest-used state f0 . • If the threshold for this state T0 is exceeded then the next higher-used state S1 will be entered and the clock frequency will change to f1 S0 S1 S2 S4 S6 S8 S3 S5 S7 S9

  11. Adaptive Clocking Architecture • Only states S4, S7, S8 and S9 are used. • In this case the output clock frequency to the packet classifier will start at f1 . S0 S1 S2 S4 S6 S8 S3 S5 S7 S9

  12. Outline Introduction Adaptive Clocking Architecture Hardware Accelerator Hierarchical Intelligent Cutting (HiCut) Multidimensional Cutting (HyperCuts) Algorithm Changes Low Power Architecture Performance

  13. Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) • The algorithm constructs the decision tree by recursively cutting the hyperspace one dimension at a time into sub regions. • The algorithm will keep cutting into the hyperspace until none of the sub regions exceed a predetermined number called binth.

  14. Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) Field2 (4 cuts) 11 00 01 10 R2 R3 R4 R7 R10 R11 R8 R9 R10 R11 R9 R10 R11 R0 R1 R5 R6 R7 R10 R11 binth = 4

  15. Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) Field2 (4 cuts) Field4 (4 cuts) Field3 (4 cuts) R9 R10 R11 R8 R9 R10 R11 R7 R10 R11 R3 R7 R10 R11 R2 R7 R10 R11 R4 R7 R10 R11 R7 R10 R11 R1 R7 R10 R11 R7 R10 R11 R0 R5 R6 R7 R10 R11 binth = 4

  16. Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) Field2 (4 cuts) R9 R10 R11 R8 R9 R10 R11 Field4 (4 cuts) Field3 (4 cuts) Field5 R7 R10 R11 R3 R7 R10 R11 R2 R7 R10 R11 R4 R7 R10 R11 R7 R10 R11 R1 R7 R10 R11 R7 R10 R11 R7 R11 R0 R5 R6 R10 binth = 4

  17. Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut)

  18. Hardware Accelerator Hierarchical Intelligent Cuttings (HiCut) • binth: Limits the amount of linear searching at leaves. • np: Number of cuts. • spfac: A multiplier which limits the amount of storage increase caused by executing cuts at a node. • spfac*number of rules at i ≤∑rules at each child of i+np

  19. Hardware Accelerator Multidimensional Cutting (HyperCuts) • The main difference from HiCuts is that HyperCuts recursively cuts the hyperspace into sub regions by performing cuts on multiple dimensions at a time.

  20. Hardware Accelerator Multidimensional Cutting (HyperCuts) Field1 (2 cuts) Field5 (2 cuts) R0 R4 R6 R1 R3 R7R8 R9 R0 R2 R5 binth = 4

  21. Hardware Accelerator Multidimensional Cutting (HyperCuts) • spfac: A multiplier which limits the amount of storage increase caused by executing cuts at a node. • max child nodes at i ≤ spfac*sqrt( number of rules at i)

  22. Hardware Accelerator Multidimensional Cutting (HyperCuts) • Region Compaction A node in the decision tree originally covers the region {[Xmin, Xmax], [Ymin, Ymax]}. However all the rules that are associated with the node are only covered by the subregion {[X’min, X’max], [Y’min, Y’max]}. Using region reduction the area that is associated with the node shrinks to the minimum space which can cover all the rules associated with the node.

  23. Hardware Accelerator Multidimensional Cutting (HyperCuts) • Pushing Common Rule Subsets Upwards An example in which all the child nodes of A share the same subset of rules {R0,R1}. As a result only A will store the subset instead of being replicated in all the children. A A R0 R1 R0 R1 R2 R0 R1 R3 R0 R1 R0 R1 R4 R2 R3 R4

  24. Hardware Accelerator Algorithm Changes • Remove the region compaction and push common rule subsets upwards heuristics from the HyperCuts algorithm.

  25. Hardware Accelerator Algorithm Changes • For HiCuts the number of cuts to an internal node starts at 32 and doubles each time the following condition is met • (spfac*number of rules at i ≤∑rules at each child of i+np) & (np<129)

  26. Hardware Accelerator Algorithm Changes • All combination of cuts between the chosen dimensions are considered if they obey the following condition where spfac can be 1, 2, 3 or 4: • (np ≦ 2(4+spfac) )& (np≧32)

  27. Hardware Accelerator Memory Structure • The hardware accelerator uses 7704-bit wide memory words. • In order to calculate which cut the packet should traverse to, the internal node stores 8-bit mask and shift values for each dimension. • The masks indicate how many cuts are to be made to each dimension while the shift values indicate each dimensions weight. • The cut to be chosen is calculated by ANDing the mask values with the corresponding 8 most significant bits from each of the packets 5 dimensions. The resulting values for each dimension are shifted by the shift values with the results added together giving the cut to be selected.

  28. Hardware Accelerator Memory Structure • Each saved rule uses 160 bits of memory. • The Destination and Source Ports use 32 bits each with 16 bits used for the min and max range values. • The Source and Destination IP addresses use 35 bits each with 32 bits used to store the address and 3 bits for the mask. • The storage requirement for the mask has been reduced from 6 to 3 bits by encoding the mask and storing 3 bits of the encoded mask value in the 3 least significant bits of the IP address when the mask is 0-27. • The protocol number uses 9 bits with 8 bits used to store the number and 1 bit for the mask. • Each 7704-bit memory word can hold up to 48 rules, and it is possible to perform a parallel search of these rules in one clock cycle.

  29. Outline Introduction Adaptive Clocking Architecture Hardware Accelerator Hierarchical Intelligent Cutting (HiCut) Multidimensional Cutting (HyperCuts) Algorithm Changes Low Power Architecture Performance

  30. Low Power Architecture

  31. Outline Introduction Adaptive Clocking Architecture Hardware Accelerator Hierarchical Intelligent Cutting (HiCut) Multidimensional Cutting (HyperCuts) Algorithm Changes Low Power Architecture Performance

  32. Performance

  33. Performance Power figures for ASIC implementation Power figures for Cyclone 3 implementation

  34. Performance ASIC implementation classifying network traces using rulesets containing 20,000 rules. Cyclone 3 implementation classifying network traces using rulesets containing 20,000 rules.

  35. Conclusion Simulation results show that ASIC and FPGA implementations of our low power architecture can reduce power consumption by between 17-88% by adjusting the frequency.

More Related