The Storage Hierarchy

The Storage Hierarchy • Registers • Cache memory • Main memory (RAM) • Hard disk • Removable media (CD, DVD etc) • Internet Fast, expensive, few Slow, cheap, a lot

What is Main Memory? Where data reside for a program that is currently running Sometimes called RAM (Random Access Memory): you can load from or store into any main memory location at any time Much slower => much cheaper => much bigger

What Main Memory Looks Like … 0 3 4 8 1 2 5 6 7 9 10 536,870,911 You can think of main memory as a big long 1D array of bytes.

Cache [4]

Cache Caching is a technology based on the memory subsystem of your computer. The main purpose of a cache is to accelerate your computer while keeping the price of the computer low. Caching allows you to do your computer tasks more rapidly.

Librarian Example Librarian behind the desk to give you the books you ask for – he fetches it for you from a set of stacks in the storeroom You ask for the book Moby Dick The librarian goes into the storeroom, gets the book, returns to the counter and gives the book to the customer. Later, the client comes back to return the book. The librarian takes the book and returns it to the storeroom. He then returns to his counter waiting for another customer. Let's say the next customer asks for Moby Dick . The librarian then has to return to the storeroom to get the book he recently handled and give it to the client

Librarian Example Cont. Under this model, the librarian has to make a complete round trip to fetch every book -- even very popular ones that are requested frequently. Is there a way to improve the performance of the librarian?

Librarian Example with Cache Let's give the librarian a backpack into which he will be able to store 10 books (in computer terms, the librarian now has a 10-book cache). In this backpack, he will put the books the clients return to him, up to a maximum of 10. The day starts. The backpack of the librarian is empty. Our first client arrives and asks for Moby Dick. The librarian has to go to the storeroom to get the book. He gives it to the client.

Librarian Example With Cache Later, the client returns and gives the book back to the librarian. Instead of returning to the storeroom to return the book, the librarian puts the book in his backpack and stands there (he checks first to see if the bag is full). Another client arrives and asks for Moby Dick. Before going to the storeroom, the librarian checks to see if this title is in his backpack. He finds it! All he has to do is take the book from the backpack and give it to the client. There's no journey into the storeroom, so the client is served more efficiently.

Not in Backpack What if the client asked for a title not in the cache (the backpack)? In this case, the librarian is less efficient with a cache than without one, because the librarian takes the time to look for the book in his backpack first. Even in our simple librarian example, the latency time (the waiting time) of searching the cache is so small compared to the time to walk back to the storeroom that it is irrelevant. The cache is small (10 books), and the time it takes to notice a miss is only a tiny fraction of the time that a journey to the storeroom takes.

What is Cache? A special kind of memory where data reside that areabout to be used or havejust been used. Very fast => very expensive => very small (typically 100 to 10,000 times as expensive as RAM per byte) Data in cache can be loaded into or stored from registers at speeds comparable to the speed of performing computations. Data that are not in cache (but that are in Main Memory) take much longer to load or store. Cache is near the CPU: either inside the CPU or on the motherboard that the CPU sits on.

The Relationship BetweenMain Memory & Cache

RAM is Slow CPU The speed of data transfer between Main Memory and the CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out.

From Cache to the CPU CPU Cache Typically, data move between cache and the CPU at speeds relatively near to that of the CPU performing calculations.

Why Have Cache? CPU Cache is much closer to the speed of the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second!

Important Facts about Cache Cache technology is the use of a faster but smaller memory type to accelerate a slower but larger memory type. When using a cache, you must check the cache to see if an item is in there. If it is there, it's called a cache hit. If not, it is called a cache miss and the computer must wait for a round trip from the larger, slower memory area. A cache has some maximum size that is much smaller than the larger storage area.

Facts about Cache cont. It is possible to have multiple layers of cache. With our librarian example, the smaller but faster memory type is the backpack, and the storeroom represents the larger and slower memory type. This is a one-level cache. There might be another layer of cache consisting of a shelf that can hold 100 books behind the counter. The librarian can check the backpack, then the shelf and then the storeroom. This would be a two-level cache.

How Cache Works When you request data from a particular address in Main Memory, here’s what happens: The hardware checks whether the data for that address is already in cache. If so, it uses it. Otherwise, it loads from Main Memory the entire cache line that contains the address.

If It’s in Cache, It’s Also in RAM If a particular memory address is currently in cache, then it’s also in Main Memory (RAM). That is, all of a program’s data are in Main Memory, but some are also in cache.

Cache Use Jargon Cache Hit: the data that the CPU needs right now are already in cache. (remember librarian example) Cache Miss: the data that the CPU needs right now are not currently in cache. If all of your data are small enough to fit in cache, then when you run your program, you’ll get almost all cache hits (except at the very beginning), which means that your performance could be excellent! Sadly, this rarely happens in real life: most problems of scientific or engineering interest are bigger than just a few MB.

Cache Lines A cache line is a small, contiguous region in cache, corresponding to a contiguous region in RAM of the same size, that is loaded all at once. Typical size of cache lines: 32 to 1024 bytes Cache reuse: the program is much more efficient if all of the data and instructions fit in cache; try to use what’s in cache a lot before using anything that isn’t in cache (e.g., tiling)

Improving Your Cache Hit Rate Many scientific codes use a lot more data than can fit in cache all at once. Therefore, you need to ensure a high cache hit rate even though you’ve got much more data than cache. So, how can you improve your cache hit rate? Use the same solution as in Real Estate: Location, Location, Location!

Cache Example 1 Cache Line Size (CLS) = 32 bytes Cache Size (CS) = 64 lines Direct Mapping Each element is 4 bytes long real a (1024, 100) …do i=1, 1024 do j=1, 100 a(i,j) = a(i,j) + 1 end do end do Cache misses = 100*1024 = 102,400

Example 1 using Loop Interchange Interchange I and J real a (1024, 100) …do j=1, 1024 do i=1, 100 a(i,j) = a(i,j) + 1 end do end do Takes advantage of spatial locality to reduce cache misses Cache misses = 102,400/8 = 12,800, number of cache lines occupied by the array a

Tiling

Tiling • Tile: A small rectangular subdomain of a problem domain. Sometimes called a block or a chunk. • Tiling: Breaking the domain into tiles. • Tiling strategy: Operate on each tile to completion, then move to the next tile. • Tile size can be set at runtime, according to what’s best for the machine that you’re running on.

Example 2 do j=1, 100do i=1, 4096 a(i) = a(i)*a(i) end do end do Cache misses = (4096/8)* 100 After accessing elements a(1..512) the cache is full and a new cache line has to be written back to memory before a new line is written to cache Each element a(i) is reused 100 times  we can apply TILING to reduce the number of cache misses

Example 2 using Tiling do ii= 1, 4096, B do j=1, 100 do i=ii, min(ii+B-1,4096) a(i) = a(i) + a(i) end do end do The cache can accommodate 512 elements of a. Thus, if we choose B=512, cache misses = 8

A Sample Application Matrix-Matrix Multiply Let A, B and C be matrices of sizes nrnc, nrnk and nknc, respectively:

Matrix Multiply: Naïve Version SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER :: r, c, q DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DO END SUBROUTINE matrix_matrix_mult_naive

Multiplying Within a Tile SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend INTEGER :: r, c, q DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0 DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DO END SUBROUTINE matrix_matrix_mult_tile

The Advantages of Tiling • It allows your code to exploit data locality better, to get much more cache reuse: your code runs faster! • It’s a relatively modest amount of extra coding (typically a few wrapper functions and some changes to loop bounds). • If you don’t need tiling – because of the hardware, the compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size.

Will Tiling Always Work? Tiling WON’T always work. Why? Well, tiling works well when: • the order in which calculations occur doesn’t matter much, AND • there are lots and lots of calculations to do for each memory movement. If either condition is absent, then tiling may not help.

The Storage Hierarchy