Loading in 2 Seconds...
Loading in 2 Seconds...
Content Addressable Memories Vahid Tabatabaee Fall 2007
References • Title: Network Processors Architectures, Protocols, and PlatformsAuthor: Panos C. LekkasPublisher: McGraw-Hill • Kostas Pagiamtzis, Ali Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J of Solid-State Circuits vol. 41, No.3, March 2006. • NetLogic MicroSystems Application Note, “Intradevice Configuration of Network Search Engines”. • NetLogic MicroSystems Application Note, “High Performance Layer 3 Forwarding”. • IDT White Paper, “Taking Packet Processing to the Next Level”.
Classification and Search Engines • Classification engine receives streams of packets as its input. • It applies a set of application-specific sorting rules and policies continuously on the packets. • It ends up compiling a series of new parallel packet streams in queues of packets.ored. • For classification the NP should consult a memory bank, a lookup table or even a data base where the rules are stored. • Search engines are used for consultation of a lookup table or a database based on rules and policies for the correct classification. Search engines are mostly based on associative memory, which is also known as CAM
What is CAM? • Content Addressable Memory is a special kind of memory! • Read operation in traditional memory: • Input is address location of the content that we are interested in it. • Output is the content of that address. • In CAM it is the reverse: • Input is associated with something stored in the memory. • Output is location where the associated content is stored.
CAM for Routing Table Implementation • CAM can be used as a search engine. • We want to find matching contents in a database or Table. • Example Routing Table Source: http://pagiamtzis.com/cam/camintro.html
Simplified CAM Block Diagram • The input to the system is the search word. • The search word is broadcast on the search lines. • Match line indicates if there were a match btw. the search and stored word. • Encoder specifies the match location. • If multiple matches, a priority encoder selects the first match. • Hit signal specifies if there is no match. • The length of the search word is long ranging from 36 to 144 bits. • Table size ranges: a few hundred to 32K. • Address space : 7 to 15 bits. Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
CAM Memory Size • Largest available around 18 Mbit (single chip). • Rule of thumb: Largest CAM chip is about half the largest available SRAM chip. • A typical CAM cell consists of two SRAM cells. • Exponential growth rate on the size Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
CAM Basics • The search-data word is loaded into the search-data register. • All match-lines are pre-charged to high (temporary match state). • Search line drivers broadcast the search word onto the differential search lines. • Each CAM core compares its stored bit against the bit on the corresponding search-lines. • Match words that have at least one missing bit, discharge to ground. Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
Type of CAMs • Binary CAM (BCAM) only stores 0s and 1s • Applications: MAC table consultation. Layer 2 security related VPN segregation. • Ternary CAM (TCAM) stores 0s, 1s and don’t cares. • Application: when we need wilds cards such as, layer 3 and 4 classification for QoS and CoS purposes. IP routing (longest prefix matching). • Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb. • CAM entries are structured as multiples of 36 bits rather than 32 bits.
CAM Advantages • They associate the input (comparand) with their memory contents in one clock cycle. • They are configurable in multiple formats of width and depth of search data that allows searches to be conducted in parallel. • CAM can be cascaded to increase the size of lookup tables that they can store. • We can add new entries into their table to learn what they don’t know before. • They are one of the appropriate solutions for higher speeds.
CAM Disadvantages • They cost several hundred of dollars per CAM even in large quantities. • They occupy a relatively large footprint on a card. • They consume excessive power. • Generic system engineering problems: • Interface with network processor. • Simultaneous table update and looking up requests.
CAM structure • The comparand bus is 72 bytes wide bidirectional. • The result bus is output. • Command bus enables instructions to be loaded to the CAM. • It has 8 configurable banks of memory. • The NPU issues a command to the CAM. • CAM then performs exact match or uses wildcard characters to extract relevant information. • There are two sets of mask registers inside the CAM.
CAM structure • There is global mask registers which can remove specific bits and a mask register that is present in each location of memory. • The search result can be • one output (highest priority) • Burst of successive results. • The output port is 24 bytes wide. • Flag and control signals specify status of the banks of the memory. • They also enable us to cascade multiple chips.
CAM Features • CAM Cascading: • We can cascade up to 8 pieces without incurring performance penalty in search time (72 bits x 512K). • We can cascade up to 32 pieces with performance degradation (72 bits x 2M). • Terminology: • Initializing the CAM: writing the table into the memory. • Learning: updating specific table entries. • Writing search key to the CAM: search operation • Handling wider keys: • Most CAM support 72 bit keys. • They can support wider keys in native hardware. • Shorter keys: can be handled at the system level more efficiently.
CAM Latency • Clock rate is between 66 to 133 MHz. • The clock speed determines maximum search capacity. • Factors affecting the search performance: • Key size • Table size • For the system designer the total latency to retrieve data from the SRAM connected to the CAM is important. • By using pipeline and multi-thread techniques for resource allocation we can ease the CAM speed requirements. Source: IDT
Packet Search Speed Requirements Source: IDT Source: IDT article in CommsDesign: http://www.commsdesign.com/showArticle.jhtml?articleID=16501972
Management of Tables Inside a CAM • It is important to squeeze as much information as we can in a CAM. • Example from Netlogic application notes: • We want to store 4 tables of 32 bit wide IP destination addresses. • The CAM is 128 bits wide. • If we store directly in every slot 96 bits are wasted. • We can arrange the 32 bit wide tables next to each other. • Every 128 bit slot is partitioned into four 32 bit slots. • These are 3rd, 2nd, 1st, and 0th tables going from left to right. • We use the global mask register to access only one of the tables.
Example Continued • We can still use the mask register (not global mask register) to do maximum prefix length match.
Table Aggregation • We can use tag bits to aggregate multiple tables in a single CAM. • Example: • We want to use a single CAM (NL85721) for IPV4 packet classification and forwarding. • We want to filter packets based on other parameters such as VPN. • We can have an undesired match when we want to do a classification. • CAM word 0 does not match but the dest. address matches CAM word 1 Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf
Tag bits to avoid undesired matches • Tag bits can be used to differentiate between tables. • Tag bits should not be masked. • For packet classification tag bit is 0 and for packet forwarding it is 1. Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf
Vertically Oriented Table Aggregation • We can use validity bits to support multiple tables with different number of entries. • We need one validity bit for each table. • When the validity bit in a slot is 1 the corresponding table has a valid entry. • In the comparand register, only the validity bit of the table that is under search should be 1. Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf
System Design Issues (multiple searches) • For deep packet inspection, several searches must occur simultaneously. • For example: MAC table, IP table, rules table, flow-management table. • Question: Do we use 4 CAMs or just 1 CAM with 4 partitions. • If we use only 1 CAM: • Some tables are very large and some small. • This approach wastes expensive partitions. • If we use 4 CAMs: • It does suffer when smaller tables do not justify using separate CAMs. • The overall cost also increases since we have to use separate SRAM too.
System Design Issues (shorter and longer search keys) • We showed how we can implement 36 bit search tables in a 72 bit wide CAM. • This approach reduces the speed to half since we need to search two time for each key. • Some CAMS are hardwired to support both 36 and 72 bit wide search keys but they are more expensive. • For longer search keys the are two choices: • We can use double data rate (DDR) bus and load meaningful bits at both the rising and dropping edge of the clock. • We can double the clock frequency of the that loads the comparands.
System Design Issues (simultaneous update and search) • CAMs can not be updated in a location while searching at the same time. • When we do update packets can not be forwarded and they are back logged. • We can have a backup CAM for update while search is done on the other CAM. • Some designs offer a third port for table maintenance without inhibiting search operations (SiberCore is an example). • Increases pin count, board real estate, signals to be routed on the board.
System Design Issues (CIDR table update) • Recall that CIDR works based on the longest prefix match (LPM). • CAM segments are created based on the prefix length. • Some empty slots are left in each segment to accommodate new entries. • If a segment is suddenly filled up, the table must be taken offline to reshuffle the entries. • A read and write operation is needed for each entry that must be relocated. We may need a read and write for the mask word too. Source: http://www.netlogicmicro.com/pdf/cidr_white_paper.pdf
CIDR table update: worst case analysis • What is the worst case scenario: All segments but one are full • A new entry may need up to 31 move operations. • Each move requires 4 clock cycles for total of 4 x 31 = 124 clock cycles • We have 3000 routing updates per second 3000 x 124 = 372000 clock cycles per second • If the NP clock rate is 100 MHz the cycle time is 10 nsec • How much time the update consumes: 372000 cycles x 10 nsec per cycle = 3.72 msec • In OC-192 rate, we have around 20 to 30 MPPS • Therefore, 74,400 to 111,600 packets will not be classified and should be discarded.
Reproaches against CAM based search engines (POWER) • There is a misnomer that power consumption of CAM increases! • It does not make sense to compare power consumptions of 2Mb CAM clocked at 66 MHz and capable of 66 Msps with 9Mb CAM clocked at 150 MHZ capable of 125 Msps. • Power consumption is result of multiple factors such as: • Semiconductor manufacturing process. • Number of searches per second. • Storage density. • The smaller the process the larger the capacity; it can also cause drop in the power supply and increase in the clock rate. • 0.18μ process 50% less power than 0.25μ and 30% further improvement in 0.15μ. • The absolute power consumption is increasing, because: • Larger table. • Wider search key for deep packet classification. • Increased wire speed. • Make sure to consider worst case scenarios not the data sheet values.
Reproaches against CAM based search engines • Table maintenance and management is a software related problem. • Third port (Synchronous Maintenance Interface [SMI]) for SiberCore CAMs is an interesting way of having table maintenance without affecting of the ongoing search processes. • Sort-free CAM that do not need partitioning CAMs. • Density and footprint (Not a real issue) example: • The three members in the family, the CYNSE10512, 10256, and 10128, provide address tables of 512k, 256k, and 128k entries (18 Mbits, 9 Mbits, and 4.5 Mbits), respectively. • All three devices are housed in 388-contact BGA packages. • Price: $75, $135, $275 • 1,000,000 entry IPV4 can be handled in two 18Mbits CAM.
Reproaches against CAM based search engines • Inflexibility with Table Configurations: • This is a real issue • Some applications need flexible table sizes and width • More research and development needed. • Price • In absolute terms they are expensive. • They are sophisticated complex products that are indispensable in most designs. So they should be expensive!