CONTENT ADDRESSING MEMORY Moshe Lopian Michael Sberro Ron Zarsky
Content-Addressable Memory Introduction Content-addressable memories (CAMs) are hardware search engines that are much faster than algorithmic approaches for search-intensive applications. CAMs are composed of conventional semiconductor memory (usually SRAM) with added comparison circuitry that enable a search operation to complete in a single clock cycle.
Applications of CAM • applications where CAMs are used • Address lookup function in Internet routers. IP routing (internet protocol) • Cache memory in microprocessors • Pattern recognition in AI robotic systems • Translation lookaside buffers (TLB) • Real-time compression and Encryption • Data compression applications • Mapping Tables and translating buffers In microprocessors • Database accelerators • Neural networks .
We will give an extensive explanation of the advantage of CAM with the example of address lookup function in Internet routers. • Then we will give a short description of the use of the CAM in some of the other applications
CAM Application: Router Address Lookup Internet routers forward data packets from an incoming port using an address lookup function. The address lookup function examines the packet's destination address and chooses an output port associated with that address. The router's list of destination addresses and their corresponding output ports is called the routing table.
The Problem • The routing parameters that determine the complexity of the • implementation are: • Entry size • Table size • Search rate • Table update rate. • Routing table sizes are about 30,000 entries but are growing • rapidly. • Routers must perform hundreds of millions of searches • per second in addition to thousands of routing table updates • per second.
All four entries in the table are 5-bit words, with the don't care bit, X. Because of the X bits, the first three entries in Table 1 represent a range of input addresses, i.e. the entry on Line 1 indicates that all addresses in the range of 10100—10111 are forwarded to port A. The router searches for the destination address of each incoming packet in the address lookup table to find the appropriate output port. For example, if the router receives a packet with the incoming address 01101, the address lookup matches both line 2 and Line 3 in the table. Line 2 is selected since it has the most defined bits, indicating it is the most direct route to the destination. This lookup style is called longest-prefix matching and is required to implement the most recent Internet Protocol (IP) networking standard.
The Solution There are many software-based methods to implement the address lookup function, although not all can meet the above requirements. For example, software-based binary searching accomplishes the task if the lookup table is ordered. Binary searching has O(log n) time complexity in addition to the extra time required to insert a new entry in the table. Almost all algorithmic approaches are too slow to keep up with projected routing requirements. In contrast, CAMs use hardware to complete a search in a single cycle, resulting in constant O(1) time complexity.
The Solution This is accomplished by adding comparison circuitry to every cell of hardware memory. The result is a fast, massively parallel lookup engine. The strength of CAMs over algorithmic approaches is their high search throughput.
Disadvantage of CAM The current bottleneck is the large power consumption due to the large amount of comparison circuitry activated in parallel. Reducing the power consumption is a key aim of current CAM research.
Content-Addressable Memory • There are two basic forms of CAM: • binary • ternary Binary CAMs support storage and searching of binary bits, zero or one (0,1). Ternary CAMs support storing of zero, one, or don't care bit (0,1,X). Ternary CAMs are presently the dominant CAM since longest-prefix routing is the Internet standard.
Figure 1 shows a block diagram of a simplified 4 x 5 bit ternary CAM. Figure 1
The CAM contains the routing table from Table 1 to illustrate how a CAM implements address lookup. The CAM core cells are arranged into four horizontal words, each five bits long. Core cells contain both storage and comparison circuitry. The search lines run vertically in the figure and broadcast the search data to the CAM cells. Cam core cell Cam 3-bit word
The matchlines The matchlines run horizontally across the array and indicate whether the search data matches the row's word. An activated matchline indicates a match and a deactivated matchline indicates a non-match, called a mismatch in the CAM literature. The matchlines are inputs to an encoder that generates the address corresponding to the match location.
CAM search operation Stage 1: A CAM search operation begins with precharging all matchlines high, putting them all temporarily in the match state. Stage 2: the search line drivers broadcast the search data, 01101 in the figure, onto the search lines. Stage 3: Then each CAM core cell compares its stored bit against the bit on its corresponding search lines. Cells with matching data do not affect the matchline but cells with a mismatch pull down the matchline. Cells storing an X operate as if a match has occurred. The result is that matchlines are pulled down for any word that has at least one mismatch.
All other matchlines remain activated (precharged high). In the figure, the two middle matchlines remain activated, indicating a match, while the other matchlines discharge to ground, indicating a mismatch. Last, the encoder generates the search address location of the matching data. In the example, the encoder selects numerically the smallest numbered matchline of the two activated matchlines, generating the match address01.
This match address is used as the input address to a RAM that contains a list of output ports as depicted in Figure 2. This CAM/RAM system is a complete implementation of an address lookup engine. The match address output of the CAM is in fact a pointer used to retrieve associated data from the RAM. In this case the associated data is the output port.
Caches memory in microprocessors The problem: • Despite improvements in technology, microprocessors are still • much faster than main memory. • Memory access time is increasingly the bottleneck in overall • application performance. • As a result, an application might spend a considerable amount • of time waiting for data. • This not only negatively impacts the overall performance, • but the application cannot benefit much from a processor • clock-speed upgrade either.
Solution: One way to overcome this problem is to insert a small high-speed buffer memory between the processor and main memory. Such a buffer is generally referred to as cache memory, or cache for short. In our case the data from the cache is retrieved from the CAM thus instead of translating the data from the processor into an address which is associated to a row in the cash, the data from the CPU is assigned to a row in the CAM which is associated to the desired row in the cache. Thus performing quicker.
Translation lookaside buffers (TLB) • The translation lookaside buffer (TLB) is a table in the processor • that contains cross-references between the virtual and real • addresses of recently referenced pages of memory. • It functions like a "hot list" or quick-lookup index of the pages • in main memory that have been most recently accessed.
Other Applications using CAM The problem: When a cache miss occurs, data must be fetched from an address in virtual memory. This virtual memory address must be translated into a main memory real address. If the real-memory address of a desired page is not in the TLB, a further delay is incurred while the real address is determined.
Other Applications using CAM The solution: Determine the real-memory address of a desired page using CAM. This will cut down the time of translating the real-memory address of a desired page. Thus the data will be retrieved faster.
The Priority Encoder • Priority encoder passes a priority token from a bit with the highest priority to the bit with the lowest priority. • The function of the ith output bit of the PE is EPi = Di*Pi where Di stands for the corresponding input data and Pi stands for the priority token passed into this bit. • There’s an iterative relationship between Piand Pi-1 as Pi = ~Di-1*Pi-1and hence the general expression for the output is EPi = Di*~Di-1*~Di-2*….*~D0
Priority Encoder – cont. • When constructing a large PE, macros are used. For example, for a 64 bit PE we use 8bit macros. The 8bit macro itself consists of two 4bit sub-macros. • The 4bit sub-macros receive a second-level lookahead signal called LA2. • The 8bit macros receive a third level lookahead signal called LA3. • For the ith 8bit macro the functions are LA3i = D8i+7+D8i+6+D8i+5+D8i+4+D8i+3+D8i+2+D8i+1+D8i LA2i = ~(D8i+3+D8i+2+D8i+1+D8i+LA3i-1)
The 8bit macro • The 8 output functions for the ith 8bit macro are expressed as EP8i = D8i*~LA3i-1 EP8i+1 = D8i+1*~ D8i*~LA3i-1 EP8i+2 = D8i+2*~D8i+1*~ D8i*~LA3i-1 EP8i+3 = D8i+3*~D8i+2*~D8i+1*~ D8i*~LA3i-1 EP8i+4 = D8i+4*LA2i EP8i+5 = D8i+5*~D8i+4*LA2i EP8i+6 = D8i+6*~D8i+5*~D8i+4*LA2i EP8i+7 = D8i+7*~D8i+6*~D8i+5*~D8i+4*LA2i
Design of the 8bit macro • We will show a High-Speed Low-Power (HSLP) design for the 8bit macro. • Low-Power consumption is achieved by two means: 1. the use of transistors instead of logic gates when implementing functions. For example, the function D8i+3+D8i+2+D8i+1+D8i+LA3i-1 is implemented by connecting 5 transistors’ drains in series, rather than using a 5 input OR gate. 2. In non LP system, when clock is low, for every i, if Di is low, then EPi goes high. When clock is raised, for every i such that Di is low, EPi is pulled down and only for j (j != i) such that Dj is high, EPj remains high. The high switching probability results in high power dissipation. In LP systems, when clock is low, for every i EPi remains low. When clock is raised, for every i such that Di is high, only one output bit changes its logic level. Hence, there’s a significant reduction in switching activity. We note that there’s a conceptual similarity between this case and the considerations regarding the Pmos dominated design of the CAM core cell. • High-Speed characteristic is achieved through the use of multilevel lookahead structure: A level does not have to operate if a higher priority level does.
4bit sub-macro Simulation The EP0 output is high whenever the corresponding input (D0) is high and clock is high too
4bit sub-macro Simulation (cont.) This time, the output EP1 is high depending on the input D1, the clock AND on the fact that no higher priority is turned on!
4bit sub-macro – simulating LA2 Why do we get These invalid values instead of Zero???
4bit sub-macro – simulating LA2 (cont.)Why did we get these invalid values? The following equation must be held: • The width is equal for all Nmos transistors, but their length must be summed. We must also calculate the distances between the transistors. We have calculated that the total length should be 20Ln. Hence we get the relation 0.25*Wn = Wp. Only by having this relation, the invalid values disappear.
8bit macro Simulationimportant note: the width in the following simulations looks very narrow, but in fact the units are in micro seconds (not in nano secs). The reason for that is that we wanted to cover all 256 cases (since we have 8 inputs) so in order to show them all in the same graph, the widths are relatively narrow. This is just for logical purposes and not functional purposes. Plots of the first 4 outputs (EP0 to EP3) as function of all 8 inputs (D0 to D7)
8bit macro Simulation (cont.) Plots of the last 4 outputs (EP4 to EP7) as function of all 8 inputs (D0 to D7)
Longer Priority Encoders • Consider a 64 bit PE. We will connect eight 8bit macros. • Since the critical path has dramatically increased, a severe problem occurred and invalid values appeared.
Invalid values: when D4 is high, no matter what is the value of any higher input (e.g. D23), the output of the corresponding output (EP23) should be low, but here we see undesired peaks! 64bit Simulation
64bit simulation – analysis • Our solution: decrease the rise-time and fall-time of the clock, increase the high level of the clock and decrease its low level period - though it’s not a good solution since it’s hard to create anti-symmetric signals. In T-Spice, instead of: CLK Gnd pulse (0 5 0 .5n .5n 10n 20n) we have: CLK Gnd pulse (5 0 0 .1n .1n 2n 10n) • But there’s a better solution, independent of rise-times, fall-times and pulse widths. It’s called folding.
64bit Simulation – cont. The undesired peaks disappeared Through changing the clock
Folding Technique in Long Priority Encoders • In the previous configuration, the LAout was connected to the neighboring 8bit macro. Actually, this output can be connected to all the other macros with lower priority. Such extra connections can shorten the critical path and hence reduce the propagation delay. This technique is called first-level folding. In our encoder we will use only the first-level folding, but other levels of folding exist too.
Realization of 64bit PE using the first-level folding technique A design as a 2-dimensional array with folding. Macro0 is connected to Macro1 and macro3; macro2 is connected to macro3 and macro5; macro4 is connected to macro5 and macro7. The rest of the macros are connected regularly – the ith macro is connected to the i+1thmacro.
Simulation of 64bit with folding The peaks disappeared due to the use of folding
In the end of the line… ROM 8x3 Every input Ri is connected to an NMOS gate In this configuration - the 3bit words in the memory encode the address Itself. Actually, the ROM functions as a decoder in our Design but the contents of the memory may be an arbitrary value corresponding to Ri.