1 / 65

Cache Automaton

Cache Automaton. Arun Subramaniyan Jingcheng Wang Ezhil R.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das. University of Michigan – Ann Arbor. MICRO-50, Boston, USA – October 17th , 2017. Pattern matching in abundance ….

agar
Download Presentation

Cache Automaton

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Automaton Arun Subramaniyan Jingcheng Wang EzhilR.M.Balasubramanian David Blaauw Dennis Sylvester Reetuparna Das University of Michigan – Ann Arbor MICRO-50, Boston, USA – October 17th, 2017

  2. Pattern matching in abundance … 127.0.0.1 - - [07/Dec/2016:11:04:58 +0100] "GET /HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0" <bookid="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-1001</publish_date><description>An in-depth look at creating applications with XML.</description></book> Log Processing / right/JJ to/[^\s]+ / / the/[^\s]+ back/RB / / [^/]+/DT longer/[^\s]+ / / ,/[^\s]+ have/VB / / [^/]+/VBD by/[^\s]+ / #alert tcp$ExXTERNAL_NETany -> $HOME_NET $HTTP_PORTS (msg:"PROTOCOL-SCADA Cogent DataHub server-side information disclosure"; flow:to_server,established; content:".asp."; nocase; http_uri; pcre:"/\x2easp\x2e($|\?)/iU"; metadata:servicehttp; reference:cve,2011-3502; classtype:web-application-attack; sid:20174; rev:4;)xx XML Parsing Natural Language Processing Network Intrusion Detection Donald Duck Donald Fauntleroy Duck Donald F Duck D F Duck Video Decoding /([STX])(.{2}?)([DBEZX])/ /([RKX])(.{2,3}?)([DBEZX])(.{2,3}?)([YX])/ /([GX])([^EDRKHPFYW])(.{2}?)([STAGCNBX])([^P])/ Data Analytics Motif search Particle Path Tracking

  3. Finite State Automata extract patterns \n \n x \s S2 S0 S1 x \s, \n x,\s Σ = {x, \s, \n}

  4. Compute-Centric Architectures Switch-Case 1 while(c != EOF) { 2 if (c == ‘\n’) { putchar('\n'); *state = S0; } 3 else 4 switch(*state) { 5 caseS0: 6 if(c != ‘\s’) { 7 putchar(c); *state = S1; 8 } break; 9 caseS1: if(c == ‘\s’) { *state = S2; } 10 else{ putchar(c); } break; 11 caseS2: break; 12 } 13 } ! ! Irregular memory accesses Branch Mispredictions

  5. Compute-Centric Architectures Table-Lookup 1 structT_table [3][3] = { 2 { {S0, S1}, {S0, S0}, {S0, S0} }, 3 { {S1, S1}, {S1, S2}, {S1, S0} }, 4 { {S2, S2}, {S2, S2}, {S2, S0} } }; 5 6 while(c != EOF) { 7 int id1 = (c == ‘\s’) ? 0 : (c == ‘\n’) ? 1 : 2 8 *state = T_table [*state] [id] [1] 9 putchar(c); 10 } ! ! Limited state transitions per cycle Memory bandwidth bottlenecked

  6. Memory-Centric Architectures 1 rank with 8 chips Micron’s Automata Processor (AP) 48k state transitions per cycle per chip 256xfaster than CPU 170xfaster than GPU (iNFAnt2) [ANMLZoo, Wadden et al. ’16 ]

  7. In-memory automata processing S1 S2 S3 S4 S5 S6 b i b i b report report start b a a i start Equivalent S1 i a a a report report a a a a report b

  8. In-memory automata processing S6 S1 S2 S3 S4 S5 b i b i b report report start b a a i start Equivalent S1 report i a a a report report a a a a Each state has incoming transitions on one input symbol b

  9. In-memory automata processing S6 S1 S2 S3 S4 S5 b b i b i b report report start report b report a a i start start Equivalent S1 S1 i i a a a report report a b a a a a a report Equivalent a a b Homogeneous NFA

  10. State representation 0 0 0 1 1 1 .. .. .. .. .. 1 1 1 0 0 .. .. 1 .. .. .. 0 0 0 0 0 .. 1 .. .. .. .. 0 0 0 97 One-hot encoding 98 * b a 8-bit ASCII alphabet 255 255 255

  11. In-memory automata processing Input symbols b report report start S1 i a b a report a a

  12. In-memory automata processing aba. . . Input symbols b report report start S1 i a b a report a a State-Match 1

  13. Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 Repurpose row address as input symbol Input symbols . . a ba 255 Active State Vector 110 1 0 1 1

  14. Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 110 1 0 1 1

  15. Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 110 1 0 1 1

  16. Massively Parallel State-Match Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 1 0 1 0 1 1 0 110 1 0 1 1

  17. In-memory automata processing aba. . . Input symbols b report report start S1 i a b a report a a State-Match 1 State-Transition 2

  18. Massively Parallel State-Transition Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 1 0 1 0 1 1 0 1 10 1 0 1 1 &

  19. Massively Parallel State-Transition Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Bit-parallel state-transition Active State Vector Custom interconnect

  20. Massively Parallel State-Transition Repurpose memory columns as FSM states S2 S1 S0 Sn b a a 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Repurpose row address as input symbol 0 1 0 0 0 0 0 1 0 Bit-parallel state-match 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Input symbols 0 0 0 0 0 0 0 . . a ba 0 0 0 0 0 0 0 255 Active State Vector 0 0 1 0 0 1 1 Custom interconnect

  21. Cache Automaton ? ! DRAM technology energy inefficient and AP is slow (~133 MHz) Low packing density (12 MB AP = 200 MB regular DRAM) Off-chip host to accelerator communication overhead ! ! Are SRAM-based caches suitable for automata processing ?

  22. Outline • Motivation • In-Memory Automata Processing • Opportunity • Challenges • Cache Automaton • Compiler • Evaluation

  23. Opportunity ✓ SRAM-based caches are faster, more energy efficient than DRAM ✓ Integrated on processor dies with performance-optimized logic ! Is cache capacity an issue for large NFA ?

  24. Opportunity

  25. Outline • Motivation • In-Memory Automata Processing • Opportunity • Challenges • Cache Automaton • Compiler • Evaluation

  26. Challenges 1 Using cache hierarchy is slow

  27. Using the cache hierarchy is slow ! L1 cache ~ 2 cycles L2 cache ~ 12 cycles L3 cache ~ 36 cycles Credits: Intel Low operating frequency ~ 111 MHz

  28. Challenges 2 Efficient state-match and state-transition in cache

  29. Column multiplexing reduces bit-parallelism Column-multiplexing Repurpose memory columns to store FSM states 0 Repurpose row address as Input symbol 1 0 1 0 1 1 0 Row decoder 255 Bit-level parallelism State-match

  30. Scalable interconnect for state-transition in cache Crossbar ? ! Overprovisioned w/ dynamic arbitration, multi-bit ports Infeasible Crossbar 100,000 x 100,000 100,000 states Network Architecture ? ? ? ? ?

  31. Outline • Motivation • In-Memory Automata Processing • Opportunity • Challenges • Cache Automaton • Compiler • Evaluation

  32. 1 In-situ computation cognizant of cache geometry can provide benefits

  33. Intel Xeon Last-Level Cache 2.5 MB Slice CBOX Way 2 Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray Tag, State, LRU LS

  34. Intel Xeon Last-Level Cache 2.5 MB Slice Chunk 63 2:1 Chunk 62 16kB subarray I/O SRAM SRAM 4:1 4:1 2:1 CBOX Decoder 2:1 SRAM I/O 4:1 SRAM 4:1 Chunk 1 2:1 Chunk 0 Way 2 Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray Tag, State, LRU LS

  35. Intel Xeon Last-Level Cache Chunk 63 2.5 MB Slice 2:1 Chunk 62 16kB subarray 8kB SRAM array I/O SRAM SRAM 4:1 4:1 255 2:1 CBOX 0 Decoder WL 2:1 Row decoder SRAM I/O 4:1 SRAM 4:1 Chunk 1 2:1 Chunk 0 255 Way 2 /BLB BL Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray LS Tag, State, LRU

  36. Intel Xeon Last-Level Cache Chunk 63 2.5 MB Slice 2:1 Chunk 62 16kB subarray 8kB SRAM array I/O SRAM SRAM 4:1 4:1 255 2:1 CBOX 0 Decoder WL 2:1 Row decoder SRAM I/O 4:1 SRAM 4:1 Chunk 1 2:1 Chunk 0 255 Way 2 /BLB BL Way 1 Way 20 Way 19 32kB data bank 16kB subarray 16kB subarray @ 4 GHz LS Tag, State, LRU

  37. 2 Sense-amplifier cycling enables highly parallel, low latency state-match

  38. Accelerating state-match is challenging … Column-multiplexing Read sequence ! 4 cycles to read 4 adjacent bits (state-match results)

  39. Sense-amplifier cycling Read sequence ! 4 cycles to read 4 adjacent bits (state-match results) Read sequence (optimized) ✓ < 2 cycles to read 4 adjacent bits (state-match results)

  40. 3 8T-based SRAM arrays can be repurposed to compactly encode state-transitions

  41. Accelerating state-transition 2T read stack Crosspoint Enable bit Compact 8T SRAM-based switches as building block of programmable interconnect

  42. 4 Real-world NFA can be grouped into dense regions with sparse connectivity

  43. Hierarchical network architecture L-switch L-switch L-switch G-switch Hierarchical switch topology scales to large NFA Real-world NFA Dense Sparse

  44. Putting it together … L-switch CBOX Way 2 Way 1 Way 20 Way 19 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

  45. Cache Automaton Architecture G-switch-1 L-switch CBOX Way 2 Way 1 Way 20 Way 19 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

  46. Cache Automaton Architecture G-switch-4 G-switch-1 L-switch Output buffer CBOX Input buffer H-bus Way 2 Way 2 Way 1 Way 1 Way 19 Way 20 Way 20 Way 19 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

  47. Cache Automaton Architecture G-switch-4 G-switch-1 L-switch 256 state partition (2 x 4KB SRAM Arrays) STE(255) STE(0) STE(1) STE(2) 0 Chunk 63 2:1 Chunk 62 8-bit Input I/O SRAM SRAM 4:1 4:1 Output buffer CBOX 2:1 Input buffer Output bit Decoder 2:1 Row decoder 255 256 b Match Vector I/O SRAM 4:1 SRAM To G-Switch-1 256 b 4:1 16b To G-Switch-4 8b 16b From G-Switch-1 Chunk 1 2:1 From G-Switch-4 Active State Vector Chunk 0 8b 256 b Way 2 Way 1 Way 20 Way 19 256 b 280x256 L-Switch Report Vector 16kB subarray Tag, State, LRU LS Array_HPA[16] = 1 Local Switch Array_LPA[16] = 0

  48. Merging connected components a a r a c c a r a c m e c a r a m e c t e l l e t r * start * start * start * start * start Merging connected components

  49. Cache Automaton designs Performance-optimized (CA_P) ✓ Small connected components -> low radix switches Redundant state activity Space-optimized (CA_S) ✓ Low footprint, no redundant state activity Large connected components -> high radix switches and careful mapping

  50. Outline • Motivation • In-Memory Automata Processing • Opportunity • Challenges • Cache Automaton • Compiler • Evaluation

More Related