August 8 th , 2011 Kevan Thompson

Creating a Scalable Coherent L2 Cache August 8th, 2011 Kevan Thompson

Outline • Motivation • Cache Background • System Overview • Methodology • Progress • Future Work 2

Motivation Goal • Create a configurable shared Last Level Cache for the use in the PolyBlaze system 3

Introduction Kevan Zia Eric 4

Cache Background • In modern systems, processors out perform main memory, creating a bottleneck • This problem is only exacerbated as more cores contend for the memory • This problem is reduced if each processor maintains a local copy of the data 5

Caches • A cache is a small amount of memory on the same die as the processor • The cache is capable of providing a lower latency and a higher throughput than the main memory • Systems may include multiple cache levels • The smallest and most local cache is the L1 cache. The next level cache is the L2, etc 6

Shared Last Level Cache • Acts as a common location for data • Can be used to maintain cache coherency between processors • Does not exist in current MicroBlaze system • We will design our own shared L2 Cache to maintain cache coherency 7

Cache Speeds • In typical systems: • An L1 cache is very fast (1 or 2 cycles ) • An L2 cache is slower (10’s of cycles) • Main memory is very slow (100’s of cycles) 8

Cache Speeds • In our system we expect : • The L1 cache to be very fast (1 or 2 cycles ) • The L2 cache to be about (10 of cycles) • Main memory to be faster (10’s of cycles) • In order to model the memory bottleneck of a much faster system we’ll need to stall the Main Memory 9

Direct Mapped Cache • Caches store Data, a Valid Bit and a unique identifier called a tag 10

Tags • As an example imagine a system with the following : • 32-bit Address Bus, and 32-bit Word Size • 64-KByte Cache with 32-Byte Line Size • Therefore we have 2047 (211) Lines 11

Set-Associated Cache A cache with n possible entries for each address is called an n-way set associated cache 4-Way Set Associated Cache 12

Replacement Policies • When an entry needs to be evicted from the cache we need to decide which Way it is evicted from. • To do this we use a replacement policy • LRU • Clock • FIFO 13

LRU • Keep track of when each entry is accessed • Always evict the Least Recently Used • Implemented using a stack Access 4 Access 2 MRU LRU 14

Clock • For each Way we store a Reference Bit • Also store a pointed to the oldest entry (Hand) • Starting with the Hand we test and clear each R Bit until we reach one that is 0 1 0 1 0 1 0 0 0 1 2 3 15

System Overview 16

PolyBlaze L2 Cache • 1-16 Way Set Associated Cache • LRU or Clock Replacement Policy • 32 or 64 Byte Line Width • 64 Bit Memory Interface • Write Back Cache 17

L2 Cache 18

Reuse Policy • Determines which Way is evicted on Cache Miss • Currently uses LRU Policy 19

Tag Bank • Contains Tags and Valid Bits • Stored on FPGA using BRAMs • Instantiate one bank for each Way 20

Control Unit • Finite State Machine for L2 Cache Pipelining • If a request is outstanding from NPI we can service other requests in SRAM 21

Data Bank • Control interface for off-chip SRAM 22

SRAM • 32-bit ZBT synchronous SRAM • 1 MB 23

Methodology • Break L2 cache into three parts and test separately then combine and test system • SRAM Controller • NPI Interface • L2 Core • Complete L2 Cache 24

SRAM Controller • Create a wrapper that connects the SRAM controller to the MicroBlaze by an FSL • Write a program that will write and read data to all addresses in the SRAM • Write all 1’s • Write all 0’s • Alternate writing all 1’s and all 0’s • Write Random data √ √ √ √ 25

NPI Interface • Uses a custom FSL width, so we cannot test using MicroBlaze • Create a hardware test bench to read and write data to all addresses • Write all 1’s • Write all 0’s • Alternate writing all 1’s and all 0’s • Write Random data X X X X 26

L2 Core X • Simulate the core of the L2 cache in iSim • Write a test bench that will approximate the responses from the L1/L2 Arbiter, SRAM Controller, and NPI Interface • The test bench will write to each line multiple times to create a large number of cache misses X X 27

Complete L2 Cache X • Combine the L2 Cache with the rest of PolyBlaze • Write test programs to read and write to various regions of memory X 28

Current Progress • SRAM Controller and Data Bank: • Designed and Tested • NPI Interface: • Testing and Debugging in Progress • L2 Core: • Testing and Debugging in Progress 29

Future Work • Add Clock Replacement Policy to L2 Cache • Add a Write Back Buffer to L2 Cache • Migrate System from XUPV5 to a BEE3 so we can create a system with more cores • Modify the L2 Cache into a NUMA system • Add Custom Hardware Accelerators to PolyBlaze 30

Questions? 31

August 8 th , 2011 Kevan Thompson