1 / 20

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

Software Systems Lab Department of Electrical Engineering, Technion. “ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy. By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz. NAHALAL PROJECT This project is based on the article :

henry-noel
Download Presentation

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Systems Lab Department of Electrical Engineering, Technion “Nahalal: Cache Organization for Chip Multiprocessors” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz

  2. NAHALAL PROJECT This project is based on the article : “Nahalal: Cache Organization for Chip Multiprocessors “ by Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser NAHALAL Project is a project that deals with an environment of multiprocessors chip. Project Goal : To give the appropriate solution for organizing a memory cache for multiprocessors environment.

  3. Project Steps • Reading article and look for suitable idea. • Learning Simics Simulator. • Installing Simics3 on project host. • Learning g-cache module C code + NAHALAL code. • Writing implementation code of our project. • Define and Writing benchmarks. • Testing and debugging our model. • Run simulations, collect statistics and find conclusions. • Writing project book, presentation and website.

  4. NAHALAL ARCHTECTURE NAHALAL archtecture defines the memory cache banks of the L2 cache. The basic distinction is between a private bank to a public bank. NAHALAL CIM

  5. NAHALAL ARCHTECTURE • Placement Policy • Replacement Policy from Private Bank : LRU • Replacement Policy from Public Bank : LRU NAHALAL

  6. LSU Improvement • Placement Policy • Replacement Policy from Private Bank : LRU • Replacement Policy from Public Bank : LSU NAHALAL

  7. LSU Implementation Way 1 • Shift-register with N cells for each CPU. • Each cell in the shift-register hold line num. • In throwing by CPUi : for each line in the shift register need to check which less appear in other CPUs shift-registers. • 64Bytes line size, 1MB public bank size 14 bits for each cell in the shift register  if N=8 then memory overhead : 8*8*14 = 896 bits • Candidates for replacement are only 8 in the shift register ! • In case of a line no one access, it will stuck forever in the public bank. • Complicated, long time algorithm in HW.

  8. LSU Implementation Way 2 • Shift-register with N cells for each Line. • Each cell in the shift-register hold CPU num • In throwing by CPUi : Search in all the shift-registers, which one contain CPUi the most. • 8 CPUs 3 bits for each cell in the shift register. 64Bytes line size, 1MB public bank size  2 lines in public bank. If N=8 Therefore 2 *8*3 = 0.375MB  37.5% memory overhead. • 37.5% memory overhead can cause to large distance between lines in public bank and CPUs • 37.5% memory overhead enlarge power. • Complicated, long time algorithm in HW. 14 14

  9. LSU Implementation Way 3 • Shift-register with N cells for each Line. • Each cell in the shift-register hold CPU num • In throwing by CPUi : For each shift-register do XOR between each cell and the ID of CPUi. The shift-register on which the XOR produce 0, will be the chosen one. If non produce 0 then do regular LRU. • In order ro reduce memory overhead, define N=4. Therefore 2 *4*3 = 0.1875MB  18.75% memory overhead. • Simple, short time algorithm in HW. 14

  10. Simics Simulator • In-order Instruction Set Simulator. Simulating code on virtual machine. • Each instruction run in one step time. • Collecting statistics during simulations. • Build models for all types of HW units and devices. • Memory Transactions, memory spaces, timing-model interface. • G-cache module for the basis to define memory hierarchy and cache modules. • Stall time is the key for cache model simulations. • Round Robin with default quantum for multiprocessors simulation.

  11. TheImplementation In the project book ….

  12. Build our project Using Simics g-cache Makefile : shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake clean Removing: g-cache shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake g-cache Generating: modules.cache === Building module "g-cache" === Creating dependencies: module_id.c Creating dependencies: sorter.c Creating dependencies: splitter.c Creating dependencies: gc.c Creating dependencies: gc-interface.c Creating dependencies: gc-attributes.c Creating dependencies: gc-cyclic-repl.c Creating dependencies: gc-lru-repl.c Creating dependencies: gc-random-repl.c Creating dependencies: gc-common-attributes.c Creating dependencies: gc-common.c Creating exportmap.elf Compiling gc-common.c Compiling gc-common-attributes.c Compiling gc-random-repl.c Compiling gc-lru-repl.c Compiling gc-cyclic-repl.c Compiling gc-attributes.c Compiling gc-interface.c Compiling gc.c Compiling module_id.c Linking g-cache.so shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib>

  13. Simulation Script Using pyhton script we defined :

  14. Writing Benchmarks Writing Benchmarks is done in the simulated target console :

  15. Writing Benchmarks • Using Threads with pthread library • Each Thread is associated to a CPU using sched library. • Parallel code is written in the benchmark • Also OS code and pthread code cause to Parallel code. • Each benchmark we run first without LSU and second with LSU. Our Goals : 1. To show that LSU reduce cycles in parallel benchmarks. 2. To show that in “LSU dedicated benchmarks” the improvement is better.

  16. Benchmark A Each thread associated to a different CPU and read from public array and then from private array : // read shared array for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY[j]; // write to desginated array : accum=0; if (thread_num == 1) { des_arrray_ptr = DESGINATED_ARRAY1; } else if (thread_num == 2) { des_arrray_ptr = DESGINATED_ARRAY2; } else { printf ("Error 0x100: thread_num has unexpexted value\n"); exit(1); } for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<DESIGNATED_ARRAY_SIZE; j++) *(des_arrray_ptr+j)=1; return NULL; }

  17. Benchmark B Each thread associated to a different CPU and read from public array, private part, and public part again. // read first shared array – move first array to public bank for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY1[j]; // now each thread reads only half part of the shared array if (thread_num == 1) { for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<(SHARED_ARRAY_SIZE/2); j++) tmp = SHARED_ARRAY1[j]; } else if (thread_num == 2) { for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=(SHARED_ARRAY_SIZE/2); j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY1[j]; } // read second array for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=0; j<SHARED_ARRAY_SIZE; j++) tmp = SHARED_ARRAY2[j]; return NULL; }

  18. Collecting Statistics Example Cache statistics: l2c ----------------- Total number of transactions: 610349 Total memory stall time: 31402835 Total memory hit stall time: 28251635 Device data reads (DMA): 0 Device data writes (DMA): 0 Uncacheable data reads: 17 Uncacheable data writes: 30738 Uncacheable instruction fetches: 0 Data read transactions: 403488 Total read stall time: 17488735 Total read hit stall time: 14383135 Data read remote hits: 0 Data read misses: 10352 Data read hit ratio: 97.43% Instruction fetch transactions: 0 Instruction fetch misses: 0 Data write transactions: 176106 Total write stall time: 4687600 Total write hit stall time: 4687600 Data write remote hits: 0 Data write misses: 0 Data write hit ratio: 100.00% Copy back transactions: 0 Number of replacments in the middle (NAHALAL): 557

  19. Results • Improvement of 54% in average stall time per transaction in benchmark A, and 61% improvement in benchmark B. • In Benchmark A 8.375% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.09% ! ∆=8.28% • In Benchmark B 8.75% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.02% ! ∆=8.73%

  20. Conclusions LSU policy significantly improve average stall time per transaction, Therefore : LSU Policy implemented in NAHALAL architecture significantly reduce number of cycles for a benchmark. LSU policy significantly reduce number of replacements in the middle, Therefore : LSU Policy implemented in NAHALAL architecture, better keep the hot shared lines in the public bank. According to our implementation, LRU is activated if LSU did not find a line, Therefore : LSU Policy as we implemented is always preferable then LRU.

More Related