1 / 17

Smart Memory

Smart Memory. Bill Lynch Sailesh Kumar and Team. Overview. High Performance Packet Processing Challenges Solution – Smart Memory Smart Memory Architecture. Packet Processing Workload Challenges. Sequential memory references For lookups (L2, L3, L4, and L7) Finite automata traversal

gates
Download Presentation

Smart Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smart Memory • Bill Lynch • Sailesh Kumar • and Team

  2. Overview • High Performance Packet Processing Challenges • Solution – Smart Memory • Smart Memory Architecture

  3. Packet Processing Workload Challenges • Sequential memory references • For lookups (L2, L3, L4, and L7) • Finite automata traversal • Read-modify-write • Statistics, counters, token-bucket, mutex, etc • Pointer and link-list management • Buffer management, packet queues, etc • Traditional implementations use • Commodity memory to store data • NPs and ASICs to process data in memory Tons of memory references and minimal compute Memory Memory Memory Memory • Performance Barriers: • Memory and chip I/O bandwidth • Memory latency • Lock for atomic access P P P P P P P P P P P P

  4. 1 0 1 0 2 3 1 1 P1 4 5 6 1 0 P3 P2 7 0 8 P4 9 P5 Low IPC Need more processors More latency In interconnect Illustration of Performance Barrier I Memory Memory Memory Memory Interconnection network IP lookup tree P P P P P P P P P P P P Requires several transactions between memory and processors High lookup delay due tohigh interconnect and memory latency

  5. Enqueue Dequeue Counters Lock free-list Lock counter Get free node Read counter Unlock free-list Write counter Lock list tail Unlock counter Read list tail Link free node Update list tail Unlock list tail Illustration of Performance Barrier II Memory Memory Memory • Lookups are read-only so relatively easy • Link-list, counters, policers, etc are read-modify-write • Requires per memory address lock in multi-core systems Interconnection network P P P P P P P P P P P P Locks often kept in memory Requires another transaction Adds significant latency Single queue or single counter operations are extremely slow

  6. Overview • High Performance Packet Processing Challenges • Memory bandwidth and latency • Limited I/O bandwidth • Solution – Smart Memory • Attach simple compute with data • Attach lock with data • Enable local memory communication • Smart Memory Architecture • Hybrid memory – eDRAM + DDR3-DRAM • Serial I/O

  7. Memory Memory Memory Memory compute compute compute compute Interconnection network P P P P P P P P P P P P Introduction to Smart Memory Memory Memory Memory • What is the real problem? • Compute occurs far away from data • Lock acquire/release occurs far from data • Solution: Make memory smarter by: Interconnection network Fortunately, compute for packet processing jobs are very modest! P P P P P P P P P P P P Enabling local communication Managing lock close to data Keeping compute close to data

  8. Memory Memory Memory Memory compute compute compute compute Interconnection network P P P P P P P P P P P P Introduction to Smart Memory Memory Memory Memory • What is the real problem? • Compute occurs far away from data • Lock acquire/release occurs far from data Interconnection network Fortunately, compute for packet processing jobs are very modest! P P P P P P P P P P P P • Smart Memory Advantages • (Get more off fewer transactions!) • Lower I/O bandwidth • Lower processing latency • Higher IPC • Significantly higher single counter/queue performance

  9. Overview • High Performance Packet Processing Challenges • Memory bandwidth and latency • Limited I/O bandwidth • Solution – Smart Memory • Attach simple compute with data • Attach lock with data • Enable local memory communication • Smart Memory Architecture • Hybrid memory – eDRAM + DDR3-DRAM • Serial chip I/O

  10. Smart Memory Capacity and Bandwidth @100G 40 20 10 5 2.5 1.25 .62 .31 .15 DPI (string) DPI (regex) ACL (algorithmic) FIB (algorithmic) Memory bandwidth (Billion accesses / packet) Queuing/ Scheduling Layer2 fwding Statistics/ Counter Basic Layer2 Packet Buffer Video Buffer 2 4 8 16 32 64 128 256 512+ Memory Capacity (MB)

  11. Smart Memory Capacity and Bandwidth @100G 40 20 10 5 2.5 1.25 .62 .31 .15 DPI (string) DPI (regex) Smart Memory uses intelligent algorithms to split the data-structures 64 banks eDRAM ACL (algorithmic) FIB (algorithmic) Memory bandwidth (Billion accesses / packet) Queuing/ Scheduling Layer2 fwding Statistics/ Counter Basic Layer2 Packet Buffer 8 channels of DDR3-DRAM Video Buffer 2 4 8 16 32 64 128 256 512+ Memory Capacity (MB)

  12. P P P P P P P P P P P P eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine P P P P Smart Memory High Level Architecture DDR3 DRAM Global interconnect: provides fair communication between processors and smart memory Packet processor complex DRAM SM Engine eDRAM eDRAM SM engine SM engine eDRAM SM engine Local interconnect: provides local communication between smart memory blocks Smart Memory complex

  13. P P P P P P P P P P P P eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM data SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine FIB key P P P P Smart Memory High Level Architecture DDR3 DRAM Packet processor complex FIB key result DRAM SM Engine read Computation occurs close to memory reducing latency Requires fewer memory transactions eDRAM eDRAM read SM engine SM engine eDRAM Split tables into eDRAM and DRAM SM engine Smart Memory complex

  14. I/O Technology Choice in Smart Memory • Smart Memory reduces the chip I/O bandwidth significantly • How to further optimize it? - Bandwidth, latency and I/O bandwidth gap is growing - On-chip bandwidth is much higher than memory I/O Smart Memory use serial I/O - 4X throughput than RLDRAM and QDR - 3X fewer pins than DDR3 and DDR4 - 2.5X reduces I/O power Based on MoSys data

  15. High Speed Line Card with Smart Memory 540+ w 212- w Power 472+ cm2 148 cm2 Area Traditional Line Card Line Card with SM 2520 $ 5600+ $ Cost 2-3 times

  16. Concluding Remarks • Packet Processing Bottlenecks • Data away from compute • I/O and memory bandwidth • Smart Memory • Keep compute close to data • Keep locking close to data • Provide inter-memory connect • Advantages • Reduced chip I/O bandwidth • High performance and low latency • Feature rich, flexible and programmable • Lower cost • One chip for several functions

  17. Thank You www.huawei.com

More Related