1 / 86

TCAM Ternary Content Addressable Memory

TCAM Ternary Content Addressable Memory. 張燕光 成大資工. Introduction – TCAM. Content-addressable memories(CAMs) enable a search operation to complete in a single clock cycle. TCAM allows a third state of "*" or "Don't Care" for adding flexibility to the search.

Download Presentation

TCAM Ternary Content Addressable Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TCAMTernary Content Addressable Memory 張燕光 成大資工

  2. Introduction – TCAM • Content-addressable memories(CAMs) enable a search operation to complete in a single clock cycle. • TCAM allows a third state of "*" or "Don't Care" for adding flexibility to the search. • The TCAM memory array stores rules in decreasing order of priorities for simplifying the priority encoder.

  3. Introduction – TCAM • Compares an input key against all TCAM entries in parallel. • Each TCAM entry is a prefix for IP lookups • In general, each entry is a ternary string • The N-bit bit-vector indicates which rules match. • N-bit priority encoder indicates the address of the matched entry with highest priority. The address is used as an index into an array in SRAM to find the action associated with this prefix.

  4. TCAM ip lookup example Destination IP Address N location memory: 1 2 3 TCAM Array of TCAM entries bit-vector: 0 1 0 1 Priority Encoder Memory location Action memory SRAM Next-hop

  5. Type of CAMs • Binary CAM (BCAM or CAM) only stores 0s and 1s • Applications: MAC table consultation. Layer 2 security related VPN segregation. • Ternary CAM (TCAM) stores 0, 1 and *. • Application: when we need wilds cards such as, layer 3 and 4 classification for QoS and CoS purposesand IP routing (longest prefix matching). • Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb, 20Mb, 36 Mb, 40Mb. • 50, 100, 360MPP • CAM entries are structured as multiples of 36-40 bits rather than 32 bits.

  6. CAM cell circuits 0 1 1 0 D=0 D=1 10-T NOR-type CAM cell 9-T NAND-type CAM cell

  7. CAM of Four 3-bit words SL(differential Searchline) pair Step 2: Precharging all match lines (MLs) high = match; low = miss Step 3: Broadcast search word to SL Step 1: Load input key Step 5: encoder maps matchline of matching location to its encoded address Step 4: Perform the cell comparisons

  8. TCAM cell • each cell in one of three logic states: 0,1, or *. • two 1-bit 6-T static random-access memory (SRAM) cells (D0/D1) are used to store three logic states of a TCAM cell. 0 1 1 1 0 1 D=1 D=0 D=* NOR-type

  9. TCAMstate • Generally, the 0, 1, and * states of a TCAM cell are set by D0 = 0 and D1 = 1, D0 = 1 and D1 = 0, and D0 = 1 and D1 = 1, respectively.

  10. TCAM • Each SRAM cell consists of two cross-coupled inverters and two additional transistors used to access each SRAM cell via two bitlines (BLs) and one wordline (WL) for read and write operations. • The pair of transistors (M1/M3 or M2/M4) forms a pulldown path from the matchline (ML).

  11. TCAM • If one pulldown path connects the ML to ground, the state of the ML becomes 0. • A pulldown path connects the ML to ground when the searchline (SL) and D0 do not match. No pulldown path that connects the ML to ground exists when the SL and D0 match. • When the TCAM cell is in the “don’t care” state, M3 and M4 prevent searchlines SL and , respectively, from being connected to ground regardless of the search bit. • when the search bit is “don’t care” because SL = 0 and = 0, M1 and M2 prevent searchlines SL and , respectively, from being connected to ground regardless of the stored value in the TCAM cell.

  12. Truth table of a TCAM

  13. 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 Priority Encoder 1 0x010 0x001 1 0 0 Introduction(4/5) • Longest prefix match using TCAM Case : Search key is a prefix Case : Search key is an IP address Prefix Global mask 0: disable 1: enable Memory location SRAM 0x000 Port A 0x000 1 0 0 1 1 1 Port B 0x001 0 1 1 0 0 X 0x001 0 1 1 0 X X 0 1 1 0 1 1 Port C 0x010 0x010 0 1 1 0 X X Port B Port C 0x011 Port D 0x011 0 1 1 X X X 0x100 Port C 0x100 1 0 0 X X X Port B 0x101 1 0 1 X X X 0x101 Decreasing Length TCAM

  14. 32-bit prefixes 31-bit prefixes 30-bit prefixes . . . 9-bit prefixes 8-bit prefixes TCAM – Priority and Update • In order to ensure the LPM is correct, the prefix length ordering in TCAM should always be maintained when updates take place. TCAM 1 M-1 Free space Prefix length ordering in TCAM

  15. TCAM – Pro and Con • Advantages • Industry vendors are providing cheaper and faster TCAM products • TCAM architecture is easy to understand and simple to manage in updating TCAM entries • TCAM’s performance is deterministic • Disadvantages • A TCAM is less dense than a RAM, storing fewer bits in the same chip area. • TCAMs dissipate more power than RAM solutions • Not straightforward to store ranges, • e.g., [0, 3]=00**, [0-5]=00**+010*

  16. TCAM Cross reference

  17. CoolCAMs • CoolCAM architectures and algorithms for making TCAM-based routing tables more power-efficient. • TCAM vendors provide a mechanism to reduce power. by selectively addressing smaller portions of TCAM • The TCAM is divided into a set of blocks; each block is a contiguous, fixed size chunk of TCAM entries • e.g. a 512k-entry TCAM could be divided into 64 blocks of 8k entries each • When a search command is issued, it is possible to specify which block(s) to use in the search • Power saving, since TCAM power consumption is proportional to the number of entries searched

  18. CoolCAMs • Observation: most prefixes in core routing tables are between 16 and 24 bits long • Put the very short (<16bit) and very long (>24bit) prefixes in a set of TCAM blocks to search on every lookup • The remaining prefixes are partitioned into “buckets,” one of which is selected by hashing for each lookup • each bucket is laid out over one or more TCAM blocks • the hashing function is restricted to merely using a selected set of input bits as an index

  19. CoolCAM – bit selection architecture

  20. CoolCAM – bit selection architecture • A route lookup, then, involves the following: • hashing function (bit selection logic, really) selects k hashing bits from the destination address, which identifies a bucket to be searched • also search the blocks with the very long and very short prefixes • In order to avoid the worst-case input, but it gives designers a power budget • Given such a power budget and a routing table, it is sufficient to find a set of hashing bits that produce a split that does not exceed the power budget (a satisfying split)

  21. CoolCAM – bit selection architecture 3 Heuristics • the first is simple: use the rightmost k bits of the first 16 bits. In almost all routing traces studied, this works well. • Second Heuristic: brute force search to check all possible subsets of k bits from the first 16. Guaranteed to find a satisfying split • Third heuristic: a greedy algorithm.Falls between the simple heuristic and the brute-force one, in terms of complexity and accuracy

  22. CoolCAM – bit selection architecture • Partitioning scheme using a Routing Trie data structure • Eliminates drawbacks of the Bit Selection architecture • worst-case bounds on power consumption do not match well with power consumption in practice • assumption that most prefixes are 16-24 bits long • Two trie-based schemes (subtree-split and postorder-splitting), both involving two steps(only differ in the mechanism for performing the first stage lookup) : • construct a binary routing trie from the routing table • partitioning step: carve out subtrees from the trie and place into buckets

  23. CoolCAM – bit selection architecture

  24. Trie-based Table Partitioning • Partitioning is based on binary triedata structure • Eliminates drawbacks of bit selection architecture • worst-case bounds on power consumption do not match well with power consumption in practice • assumption that most prefixes are 16-24 bits long • Two trie-based schemes (subtree-split and postorder-splitting), both involving two steps: • construct a binary triefrom the routing table • partitioning step: carve out subtrees from the trie and place into buckets • The two schemes differ in their partitioning step

  25. Trie-based Architecture • Trie-based forwarding engine architecture • use an index TCAM(instead of hashing) to determine which bucket to search • requires searching the entire index TCAM, but typically the index TCAM is very small

  26. Routing Trie Example Routing Table: Corresponding 1-bit trie:

  27. Splitting into subtrees • Subtree-split algorithm: • input: b = maximum size of a TCAM bucket • output: a set of K TCAM buckets, each with size inthe range , and an index TCAM of size K • Partitioning step: post order traversal of the trie, looking for carving nodes. • Carving node: a node with count  and with a parent whose count is > b • When we find a carving node v , • carve out the subtree rooted at v, and place it in a separate bucket • place the prefix of v in the index TCAM, along with the covering prefix of v • counts of all ancestors of v are decreased by count(v )

  28. Subtree-split: Example b = 4

  29. Subtree-split: Example b = 4

  30. Subtree-split: Example b = 4

  31. Subtree-split: Example b = 4

  32. Subtree-split: Remarks • Subtree-split creates buckets whose size range from b/2 to b (except the last, which ranges from 1 to b ) • At most one covering prefix is added to each bucket • The total number of buckets created ranges from N/b to 2N/b ; each bucket results in one entry in the index TCAM • Using subtree-split in a TCAM with K buckets, during any lookup at most K + 2N /K prefixes are searched from the index and data TCAMs • Total complexity of the subtree-split algorithm is O(N +NW /b)

  33. Post-order splitting • Partitions table into buckets of exactly b prefixes • improvement over subtree-split, where the smallest and largest bucket sizes can vary by a factor of 2 • Cost: more entries in the index TCAM • Partitioning step: post-order traversal of the trie, looking for subtrees to carve out, but, • Buckets are made from collections of subtrees, rather than just a single subtree • because it is possible the entire trie does not contain N /b subtrees of exactly b prefixes each

  34. Post-order splitting • postorder-split : does a post-order traversal of the trie, calling carve-exact to carve out subtree collections of size b • carve-exact : does the actual carving • if it’s at a node with count = b , then it can simply carve out that subtree • if it’s at a node with count < b , whose parent has count  b , do nothing (since we will later have a chance to carve the parent) • if it’s at a node with count x, where x < b , and the node’s parent has count > b , then… • carve out the subtree of size x at this node, and • recursively call carve-exact again, this time looking for a carving of size b - x (instead of b)

  35. Post-order split: Example b = 4

  36. Post-order split: Example b = 4

  37. Post-order split: Example b = 4

  38. Postorder-split: Remarks • Postorder-split creates buckets of size b (except the last, which ranges from 1 to b ) • At most W covering prefixes are added to each bucket, where W is the length of the longest prefix in the table • The total number of buckets created is exactly N/b. Each bucket results in at most W +1 entries in the index TCAM • Using postorder-split in a TCAM with K buckets, during any lookup at most (W +1)K + N /K+W prefixes are searched from the index and data TCAMs • Total complexity of the postorder-split algorithm is O(N +NW /b)

  39. TCAM update PLO_OPT (prefix length ordering) CAO_OPT (Chain-ancestor ordering)

  40. TCAM update (1/7) • Update Scheme • Prefix-length ordering constraint:PLO_OPT Two prefixes of the same length don’t need to be in any specific order. • Chain-ancestor ordering constraint CAO_OPT There’s an ordering constraint between two prefixes if and only if one is a prefix of the other.

  41. TCAM update (2/7) • PLO_OPT • Divide all prefixes into different groups by length. • Two prefixes in the same group can be in any order. • keep all the unused entries in the center of the TCAM. • The worst-case number of memory operations per update is L/2. 1 32-bit prefixes 31-bit prefixes . . . . . . 21-bit prefixes free space 20-bit prefixes . . . 9-bit prefixes 24 . . . 8-bit prefixes 25

  42. TCAM update (3/7) 1 32-bit prefixes PLO_OPT Deletion PLO_OPT Insertion 31-bit prefixes . . . 21-bit prefixes Prefix:208.12.82/20 Prefix:208.12.63.82/32 free space 20-bit prefixes Free space end boundary +1 Free space start boundary +1 9-bit prefixes 24 . . . 8-bit prefixes 25

  43. Q3 Q2 Q1 free space Q2 Q4 Q3 Q4 Q1 TCAM update(4/9) • CAO_OPT • PLO constraint is more restrictive, the constraint can be relaxed to only overlapping prefixes. Prefixes on the same chain of the trie need to be ordered.

  44. TCAM update (5/9) • CAO_OPT • A logical inverted trie can be superimposed on the prefixes stored in the TCAM. • Only prefixes in the same path have ordering constraint. • The CAO_OPT algorithm also keeps the empty space in the center of theTCAM. • For every prefix, the longest chain that this prefix belongs to should be split around the empty space as equally as possible. • The worst-case number of memory operations per update is D/2.where D is the max length of any chain in the trie. Q3 Maximal chain Q2 free space Q4 Q1

  45. TCAM update (6/9) • Notation statement • LC(p) • len(LC(p)) • rootpath(p) • ancestors of p • children of p • hcld(p) • HCN(p) Children of p hcld(p) free space p LC(p) Ancestors of p

  46. TCAM update (7/9) CAO_OPT Insertion Insert q here Case2: When the prefix to be inserted is below the free space . Case1: When the prefix to be inserted is above the free space . free space free space LC(q) LC(q) HCN(q) Insert q here Moving prefixes on HCN(q) Moving prefixes on LC(q)

  47. TCAM update (8/9) CAO_OPT Deletion It works on the chain that has prefix p adjacent to the free space. free space q delete

  48. TCAM update (9/9) • Shortcoming to shift above schemes for IPv6 • PLO: • There are 128 different length of prefix in IPv6. Therefore, the worst case and average case of cost to shift the prefix stored in TCAM growth extremely. • CAO: • CAO need to maintain the additional trie structure using SRAM. Each update cost O(L) time to modify the data store in the trie. • In order to reorder the chain sequence, the router needs to hold-on for prefix update. It will cause packet drop rate increase when needing more memory access times.

  49. Introduction – Rule Table

  50. TCAM range encoding • The direct range-to-prefix conversion is the traditional database independent. • The primary advantage of database independent schemes is their fast update operations. • However, database independent schemes suffer from large TCAM memory consumption.

More Related