1 / 41

Block Design Review: Substrate Decap and IPv4 Parse

Block Design Review: Substrate Decap and IPv4 Parse. Brandon Heller bdh4@cec.wustl.edu http://www.arl.wustl.edu/projects/techX. Revision History. 9/26/06 (BDH): Released 9/28/06 (BDH): SD now at 5Gbps+. Header Format. Lookup. Tx. Rx. Substr Decap. Parse. QM. Contents.

pschneider
Download Presentation

Block Design Review: Substrate Decap and IPv4 Parse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Block Design Review:Substrate Decap and IPv4 Parse Brandon Heller bdh4@cec.wustl.edu http://www.arl.wustl.edu/projects/techX

  2. Revision History • 9/26/06 (BDH): • Released • 9/28/06 (BDH): • SD now at 5Gbps+

  3. Header Format Lookup Tx Rx Substr Decap Parse QM Contents • slide taken from PlanetLab_Design.ppt • For SD and Parse: • overview • block diagram • memory usage • code locations • test procedures • Performance analysis • Unexpected interactions • Future work

  4. Substrate Decap

  5. Header Format Lookup Tx Rx Substr Decap Parse QM Substrate Decap • slide taken from PlanetLab_Design.ppt • Main functions: • validate & consume Ethernet header • look up code_option and slice_data_ptr based on VLAN tag • validate & consume substrate UDP/IP headers • pass relevant fields to IPv4 parse • Single code path • NN communication • Uses 8 threads • Name change from Demux

  6. Header Format Lookup Tx Rx Substr Decap Parse Buf Handle(32b) Buf Handle(32b) MN Frm Length(16b) MN Frm Offset (16b) Slice ID (VLAN) (16b) Rx UDP DPort (16b) Eth. Frame Len (16b) Reserved (8b) Port (8b) Rx IP SAddr (32b) Rx UDP SPort (16b) Reserved (12b) Code (4b) QM IPv4 MR Functional Blocks • slide taken from PlanetLab_Design.ppt DstAddr (6B) Ethernet Header SrcAddr (6B) Type=802.1Q (2B) VLAN (2B) Type=IP (2B) Ver/HLen/Tos/Len (4B) ID/Flags/FragOff (4B) TTL (1B) Protocol = UDP (1B) Hdr Cksum (2B) Dst Addr (4B) IP Header Src Addr (4B) Slice Data Ptr (32b) IP Options (0-40B) Src Port (2B) UDP Header Dst Port (2B) UDP length (2B) UDP checksum (2B) UDP Payload (MN Packet) PAD (nB) Ethernet Trailer CRC (4B)

  7. Ethernet Validation • No alignment necessary • Counters kept in non-VLAN-specific region • Tests for • invalid Ethernet packet length • non-VLAN tag protocol ID • non-locally-addressed packet • unrecognized VLAN

  8. VLAN Table • code_option = 0 implies invalid slice • “on switch” for a slice in the data plane • SD data is currently only counters • 64B slice data • SRAM space for all 4096 VLANs

  9. Substrate UDP/IP Validation • Header checks per RFC1812: • IP ver other than 4 • invalid header length • length too small • IP len doesn't match Enet-deduced IP len • UDP len doesn't match IP-deduced UDP len • NOTE: need to check Ethernet length, to ensure that padded 64B packets are using the correct length

  10. SD Block Diagram • add one 4B SRAM increment per counter (none currently for common case) substrate_decap() mem access dl_source() Signal next ctx Read Eth/IP Hdrs DRAM: 5 8B reads NN Dequeue Validate Ethernet init signal Wait for prev ctx Read VLAN table SRAM: 2 4B reads Validate IP Read UDP hdr DRAM: 2 8B reads Signal next ctx Validate UDP NN Enqueue Prepare ring data Wait for prev ctx dl_sink()

  11. File locations (in …/IPv4_MR/) • Code • src/substrate_decap/PL/substrate_decap.[c,h] • src/dispatch_loop/PL/substrate_decap_dl.[c,h] • src/dispatch_loop/PL/dl_source.[c,h] • dl_source() and dl_sink() functions • adds ordered thread synchronization if the following defined: • DL_ORDERED • FIRST_ORDERED_ME • LAST_ORDERED_ME • src/IXP2XXX_book/Chapter09/ordered_signal.[c,h] • functions for ordered thread synchronization • src/dispatch_loop/PL/nn_rings.[c,h] • functions for enqueuing and dequeuing NN ring data • Data formats • src/PL/ipv4_common.h • IP and UDP structure definitions • src/PL/substrate_common.h • Ethernet VLAN structure definitions • src/dispatch_loop/PL/ring_formats.h • ring data struct defs • build/PL/dispatch_loop/dl_system.h • memory locations

  12. Required Includes • Files • IXA_SDK_4.0\microengineC\src\intrinsic.c • IXA_SDK_4.0\microengineC\src\rtl.c • Directories • IXA_SDK_4.0\src\library\microblocks_library\microc\ • IXA_SDK_4.0\MicroengineC\include\..\..\..\..\ • IXA_SDK_4.0\src\library\dataplane_library\microc\ • These are required to gain access to the buffer libraries and intrinsic functions!

  13. SD Initialization • All memory locations defined in dl_system.h, incl: • locations for MAC address • IPV4_SD_MAC_ADDR_HI32 • IPV4_SD_MAC_ADDR_LO16 • non-VLAN-specific counters • IPV4_SD_COUNTERS_BASE • IPV4_SD_COUNTERS_SIZE • VLAN table • IPV4_SD_VLAN_CODE_OPT_TABLE_x (BASE, SIZE, ENTRY_SIZE) • VLAN-specific memory • SLICE_DATA_TABLE_x (BASE, SIZE, ENTRY_SIZE, ENTRY_TOTAL) • IPV4_SD_SLICE_DATA_ENTRY_OFFSET • At least one slice must be initialized to send packets • Call init_slice() from system_init.ind • Currently 0xaaa initialized by default • All counters zeroed • SD caches MAC address in registers • Thread 0 waits for signal from rx

  14. Substrate Decap Validation • All validation tests done with 1 thread and substrate_decap_tests.tcs • Ethernet validation/counter tests • invalid Ethernet packet length • non-VLAN tag protocol ID • non-locally-addressed packet • unrecognized VLAN • UDP/IP validation/counter tests • IP ver other than 4 • invalid header length • length too small • IP len doesn't match Enet-deduced IP len • UDP len doesn't match IP-deduced UDP len • Watched counters for proper number of increments • Fully valid packet: vlan_ip_udp_ip_udp/tcp (speed_test_all_valid.tcs) • Verified all fields of output ring data were as expected • Single-thread plus 8-thread • Hardware testing • Uses Fred’s sp++ utility with a logged trace of the above packets • observed exact same behavior as in simulation

  15. SD Other • Bugs • substrate IP proto not checked, should correspond to UDP • Untested • buffer drops • Data Structures • substrate_decap_vlan_table_entry_t • substrate_decap_stats_t • substrate_decap_vlan_stats_t • vlan_ip_header • ipv4_header_struct • vlan_header_struct • udp_header • Performance • coming later

  16. IPv4 Parse

  17. Header Format Lookup Tx Rx Substr Decap Parse QM IPv4 Parse • slide taken from PlanetLab_Design.ppt • Main functions • Read/align IP header • Validate and consume IP header (per RFC1812 5.2.2) • Update IP header • Dec TTL • Recalc IP checksum • Write updated checksum to DRAM • Read/align L4 (UDP/TCP/other) header • Mark exceptions for Header Format • Extract fields for Lookup

  18. Header Format Lookup Tx Rx DeMux Parse Buf Handle(32b) IP Pkt Length (16b) IP Pkt Offset (16b) Lookup Key[143-112] Slice ID/Rx UDP DPort (32b) Lookup Key[111-80] DA (32b) Lookup Key[ 79-48] SA (32b) Lookup Key[ 47-16] Ports (32b) L Flags (4b) Exception Bits (12b) Lookup Key Proto/TCP_Flags [15- 0] (16b) QM IPv4 MR Functional Blocks • IPv4 Exception Bits • Bit 0: TTL = 0 or 1 • Bit 1: Options Buf Handle(32b) MN Frm Length(16b) MN Frm Offset (16b) Slice ID (VLAN) (16b) Rx UDP DPort (16b) Rx IP SAddr (32b) Rx UDP SPort (16b) Reserved (12b) Code (4b) Slice Data Ptr (32b) Slice Data Ptr (32b) Reserved (28b) Code (4b)

  19. IPv4 Internal Header Formats • 4 bits at start discriminate between IPv4 and internal headers • for more details see planetlab_IPv4_MR_parse_hdr_format.ppt in bdh4\techx\IPv4_MR_shared Zeros (4b) Type (6b) Len (6b) Rx UDP DPort (2B) Tx UDP DPort (2B) Tx UDP SPort (2B) Type Dependent Data (8B) Tx IP DAddr (4B)

  20. Parse Validation • IPv4_parse_tests.tcs • Invalid internal header • invalid len for internal header type • internal header type unknown • Invalid IPv4 (RFC 1812 checks) • IP ver other than 4 • invalid header length • length too small • SD IP len doesn't match packet IP len • invalid header checksum • IPv4 Exceptions • options flag set in packet • TTL equals zero • TTL equals one • IPv4_parse_valid.tcs • Fully valid, no-exceptions packets • from GPE, classify • from GPE, non-classify • ingress, TCP • ingress, UDP

  21. Parse Block Diagram • add one 4B SRAM increment per counter (none currently for common case) ipv4_parse() mem access dl_source() Read Int Hdr DRAM: 2 8B reads Signal next ctx Handle Internal (DRAM: 2 8B reads) NN Dequeue init signal Read IP DRAM: 4 8B reads Wait for prev ctx Checksum Validate IP Signal next ctx Read L4 DRAM: 4 8B reads NN Enqueue Handle L4 Wait for prev ctx Prepare ring data dl_sink()

  22. File locations (in …/IPv4_MR/) • Code • src/ipv4/PL/ipv4_parse[c,h] • src/dispatch_loop/PL/parse_dl.[c,h] • src/parse/PL/parse.[c,h] • src/dispatch_loop/PL/dl_source.[c,h] • dl_source() and dl_sink() functions • adds ordered thread synchronization if the following defined: • DL_ORDERED • FIRST_ORDERED_ME • LAST_ORDERED_ME • src/IXP2XXX_book/Chapter09/ordered_signal.[c,h] • functions for ordered thread synchronization • src/dispatch_loop/PL/nn_rings.[c,h] • functions for enqueuing and dequeuing NN ring data • Data formats • src/PL/ipv4_common.h • IP and UDP structure definitions • src/dispatch_loop/PL/ring_formats.h • ring data struct defs • build/PL/dispatch_loop/dl_system.h • memory locations

  23. Parse Initialization • All memory locations defined in dl_system.h, incl: • VLAN-specific memory • SLICE_DATA_TABLE_x (BASE, SIZE, ENTRY_SIZE, ENTRY_TOTAL) • IPV4_PARSE_SLICE_DATA_ENTRY_OFFSET • At least one slice must be initialized to send packets • Call init_slice() from system_init.ind • Currently 0xaaa initialized by default • All counters zeroed

  24. Other • Bugs • none? • Untested • buffer drops • Unimplemented • checksum for IP options not handled yet • Data Structures • parse_vlan_stats_t • ipv4_header_struct • udp_header_struct • tcp_header_struct • Performance • coming next

  25. Performance

  26. Packet Sizes

  27. Cycle Budget (min eth packets) • To hit 5Gb rate: • 76B per min IPv4 packet (64 min Eth + 12B IFS) • 1.4Ghz clock rate • 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mp/sec • 1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet • compute budget: 170 cycles • latency budget: (threads*170) • 4 threads : 680 cycles • 8 threads: 1360 cycles

  28. Cycle Budget (IPv4 MN packets) • To hit 5Gb rate: • 90B per min IPv4 packet (78 min IPv4MN + 12B IFS) • 1.4Ghz clock rate • 5 Gb/sec * 1B/8b * packet/90B = 6.94 Mp/sec • 1.4Gcycle/sec * 1 sec/ 6.94 Mp = 201.7 cycles per packet • compute budget: 201 cycles • latency budget: (threads*201) • 4 threads : 804 cycles • 8 threads: 1608 cycles

  29. Performance Anomalies • Spot the issue! • these issues have since been fixed! Substrate Decap unhidden DRAM latency more DRAM contention

  30. Substrate Decap Performance • Optimized common case (ingress, no options) • Combined initial header checks • No options assumed  single DRAM read • 153 cycles typical • ~650 cycles latency • 337 control store instructions • Expected performance • (201/153)*5Gb = ~6.5Gb expected performance • Simulated performance (as of 9/26/2006) • >5 Gb, but something else slows down 6Gb input

  31. SD Optimizations • possible optimizations • caching VLAN-to-CodeOption table in Local Memory • optimize nn_dequeue_incr() via assembly coding • move VLAN counter computation off fast path? • use transfer regs directly • saves 9 cycles • remove volatile statements

  32. Parse Performance • single-threaded • ~380 cycles for computation • 1708 cycles latency • 556 control store insts • Expected performance • (201/380)*5Gb = <3Gb expected performance • Going to optimize a bit before add all 8 threads

  33. Parse Optimizations • possible optimizations • incremental IPv4 checksum update per RFC1624 • checksum computation in assembler • optimized 5LW alignment for IP read • combined initial error-check to optimize common case • reduces branch delays • slows down exception path

  34. Implementation Status • Parse needs • error testing • IP options with checksum • multithreading • drop tests

  35. Image Slide Template

  36. Text Slide Template

  37. Extra Slides

  38. Parse Memory Usage • Memory reads/writes • 2 8B DRAM reads: unaligned internal header • 2 8B DRAM reads: unaligned internal header + FwdKey • 4 8B DRAM reads: unaligned IPv4 header • [0,6] DRAM reads: unaligned IPv4 header options • 4 8B DRAM reads: unaligned L4 header • 1 SRAM increment: per counter • 1 DRAM write: updated TTL and checksum

  39. Ethernet Validation • First, read packet from memory, guaranteed aligned • Not specific to any VLAN - in separate mem area • For efficiency, can keep counters in LM and update to RAM when a signal is triggered typedef struct _substrate_decap_stats_t { unsigned int rx; // received unsigned int pass; // passed to next stage unsigned int dropLen // invalid Ethernet packet length unsigned int dropTPID; // non-VLAN tag protocol ID unsigned int dropDst; // non-locally-addressed packet unsigned int dropVLAN; // unrecognized VLAN } substrate_decap_stats_t;

  40. UDP/IP Validation typedef struct _substrate_decap_slice_stats_t { unsigned int dropIPVer; // IP ver other than 4 unsigned int dropHdrLen; // invalid header length unsigned int dropLenSmall; // length too small unsigned int dropLenMismatch; // IP len doesn't match Enet IP len unsigned int dropUDPLen; // UDP len doesn't match IP UDP len unsigned int pass; // passed to next stage } substrate_decap_slice_stats_t;

  41. RFC 1812 5.2.2 IP Header Validation • The packet length reported by the Link Layer must be large enough to hold the minimum length legal IP datagram (20 bytes) (2) The IP checksum must be correct. (3) The IP version number must be 4. If the version number is not 4 then the packet may be another version of IP, such as IPng or ST-II. 4) The IP header length field must be large enough to hold the minimum length legal IP datagram (20 bytes = 5 words). (5) The IP total length field must be large enough to hold the IP datagram header, whose length is specified in the IP header length field. from http://www.faqs.org/rfcs/rfc1812.html

More Related