chea p silicon: myth or reality?

cheap silicon: myth or reality? Picking the right data plane hardware for software defined networking Gergely Pongrácz, László Molnár, Zoltán Lajos Kis, Zoltán Turányi TrafficLab, Ericsson Research, Budapest, Hungary

DP CHIP landscapethe usual way of thinking The main question that is seldom asked: How big is the difference? Programmability Assuming same use case and table sizes SNP, Netronome Generic NPrun-to-completion(lower performance) NP4 Programmable pipeline Fulcrum Broadcom/Marvel Fixedpipeline(higher performance) performance

first comparison So it seems there is a 5-10x difference between “cheap silicon” and programmable devices NPUs ~4-5 W / 10G But do we compare apples to apples? CPUs ~25 W / 10G Prog. pipelines~3-4 W / 10G Switches~0.5 W / 10G

the PBB scenario- modelling summary -

Simple NP/CPU model Internal bus Accelerators (e.g., RE engines, TCAM, hw queue, encyption) I/O Ethernet(e.g, 10G, 40G) ExternalResourceControl (e.g., optional TCAM,external memory) Optional accelerators(e.g, TCAM) Processing unit(s) (e.g., pipeline, execution unit) Fabric(e.g, Interlaken) External memory(e.g, DDR3) System(e.g, PCIe) On-Chip memory (e.g., cache, scratchpad)

Internal bus None 96x10G • 8 MCT • 340 Mtps • >1 GB External port 256 cores@ 1 GHz Low latency RAM(e.g, RLDRAM) Fabric High capacity memory(e.g, DDR3) System • L1 • SRAM • 4B/clock • per core • >128B • L2 • eDRAM • 24 Gtps • shared • >2 MB

packet walkthrough • read frame from I/O: copy to L2 memory, copy header to L1 memory • parse fixed header fields • find extended VLAN {sport, S-VID  eVLAN}: table in L2 memory • MAC lookup and learning {eVLAN, C-DMAC  B-DMAC, dport, flags}: table in ext. memory • encapsulate: create new header, fill in values (no further lookup) • send frame to I/O

assembly code • Don’t worry, no time for this  • but the code pieces can be found in the paper

pps/bw calculationonly summary* • PBB processing in average • 104 clock cycles • 25 L2 operations (depends on packet size) • 1 external RAM operation • Calculated performance (pps) • packet size = 750B  960 Mpps = 5760 Gbps • cores + L1 memory: 2462 Mpps • L2 memory: 960 Mpps • ext. memory: 2720 Mpps • packet size = 64B  bottleneck moves to cores  2462 Mpps = 1260 Gbps * assembly code and detailed calculation is available in the paper

Ethernet PBB scenariooverview of results • Results are theoretical: programmable chips today are designed for more complex tasks with less I/O ports 13-16 vs. 10-13 Mpps / Watt: 20-30% advantage: around 1.25x instead of 10x

summary and next steps I’d have to make it really fast if I spent >8 minutes so far 

what we’ve learnedso far… • Performance depends mainly on the use case, not on the selected hardware solution • not valid for Intel-like generic CPU – much lower perf. at simple use cases • but even this might change with manycore Intel products (e.g. Xeon Phi) • on a board/card level local processor also counts – known problem for NP4 • Future memory technologies (e.g. HMC, HBM, 3D) might change the picture again • much higher transaction rate, low power consumption

But! – no free lunchthe hard part: I/O balance • So far it seems that a programmable NPU would be suitable for all tasks • BUT! For which use case shall we balance the I/O and the processing complex? • today we have a (mostly) static I/O built together with the NPU • we do have >10x packet processing performance difference between important use cases • How to solve it? • Different NPU – I/O flavors: still quite static solution • but an (almost) always oversubscribed I/O could do the job • I/O – forwarding separation: modular HW

what is nextongoing and planned activities • Prove by prototyping • use ongoing OpenFlow prototyping activity • OF switch can be configured to act as PBB • SNP hardware will be available in our lab at 2013 Q4 • Intel (DPDK) version is ready, first results will be demonstrated @ EWSDN 13 • Evaluate the model and make it more accurate • more accurate memory and processor models • e.g. calculate with utilization based power consumption • identify possible other bottlenecks • e.g. backplane, on-chip network

thank you! And let’s discuss these further

chea p silicon: myth or reality?