Project phase 1 grading
Download
1 / 19

Project : Phase 1 Grading - PowerPoint PPT Presentation


  • 225 Views
  • Updated On :

Project : Phase 1 Grading. Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values and Charts (25 points) Analyses (5 points) L2 cache Replacement Statistics (30 points) Values and Charts (30 points).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Project : Phase 1 Grading' - lael


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Project phase 1 grading l.jpg
Project : Phase 1 Grading

  • Default Statistics (40 points)

    • Values and Charts (30 points)

    • Analyses (10 points)

  • Branch Predictor Statistics (30 points)

    • Values and Charts (25 points)

    • Analyses (5 points)

  • L2 cache Replacement Statistics (30 points)

    • Values and Charts (30 points)


Default statistics analyses l.jpg
Default Statistics: Analyses

  • CPI affected by

    • Percentage of branches, predictability of branches

    • Cache hit rates

    • Parallelism inherent in programs

  • CPI of cc and go higher than others

    • Larger percentage of tough to predict branches

    • cc: 17% branches abt 12% of which is miss-predicted

    • Go: 13% branches abt 20% of which is miss-predicted

  • CPI of cc higher than go

    • L1 miss rate of cc (2.6%) is higher than go (0.6%)


Default statistics analyses3 l.jpg
Default Statistics: Analyses

  • Compress has high miss rates

    • Smaller execution run: compulsory misses

  • L2 miss rate of anagram high

    • Very few L2 accesses : compulsory misses

  • Program based analyses

    • Gcc has lot of branches

    • Go program has small memory footprint

    • Anagram is a simple program

    • Compress: input file only 20 bytes

    • Note: All are integer programs

  • CPI < 1, multiple issue, out of order


Branch predictor statistics l.jpg
Branch Predictor: Statistics

  • Perfect > Bimodal > taken = not-taken

  • Variation across benchmarks (2 points)

    • Go and cc show greatest variation

    • They have significant number of tough to predict branches.


L2 replacement policies l.jpg
L2 replacement policies

  • No great change in miss-rate or CPI

    • 30 points for the values and plots

    • L1 cache was big so very few L2 accesses

    • Associativity of L2 cache was small

  • LRU > FIFO > Random


Distribution l.jpg
Distribution

  • 90 – 100


Phase 2 profile guided opt l.jpg
Phase 2 :Profile guided OPT

  • Profiling Run

    • Run un-optimized code with sample inputs

    • Instrument code to collect information about the run

      • Callgraph frequencies

      • Basicblock frequencies

  • Recompile

    • Use collected information to produce better code

      • Inlining

      • Put hot code together to improve I$


Phase 2 compiler branch hints l.jpg
Phase 2: Compiler branch hints

  • if (error) // not-taken

    • {

    • }

  • Compiler provides “hints” about branches taken/not-taken using profile information

  • In this question

    • Learn to use simulator as a profiler

    • Learn to estimate benefits of optimizations.


Example l.jpg
Example

  • Simple loop

    • 1000: …

    • 1004: …

      // mostly not taken

    • 1008: jz 1020

    • 1012: jmp 1000

  • For each branch mark taken or not-taken

    • Taken > 50%

      • Mark taken

    • Not-taken > 50%

      • Mark Not-taken

    • In the above example

      • 1008: not-taken

      • 1032: not-taken

      • 1064: taken


Profiling run l.jpg
Profiling Run

  • For each static branch instruction

    • Collect execution frequency

    • Percentage taken/not-taken

  • Modify bpred_update function in bpred.c

    • Maintain data structure for each branch instruction indexed by instruction address

    • Maintain frequency, taken information

    • Dump this information in the end.


Analysis l.jpg
Analysis

  • From the information collected

    • If branch is taken > 50% of time, mark taken;

      • Otherwise not-taken

    • Remember the instruction addresses and the hint.


Performance estimation l.jpg
Performance Estimation

  • For all branches

    • Predict taken/ not-taken according to the hint

    • You may want to load all the hints into a data structure at the start.

      • Data structure similar to one used for profiling.

      • Indexed by branch instruction address.

  • Estimate new CPI

  • Notes:

    • Sufficient to do this for cc and anagram.

    • After modifying SimpleScalar need to “make” !!!


Phase2 l2 replacement policy l.jpg
Phase2: L2 replacement policy

  • LRU policy

    • Works well

    • HW complexity is high

      • Number of status bits to track when each block in a set is last accessed

      • This number increases with associativity.

  • PLRU

    • Pseudo LRU policies

    • Simpler replacement policy that attempts to mimic LRU.


Tree based plru policy l.jpg
Tree based PLRU policy

  • For a ‘n’ way cache, there are nway -1 binary decision bits

  • Let us consider a 4 way set associative cache

    • L0, L1, L2 and L3 are the blocks in the set

    • B0, B1 and B2 are decision bits



Notes l.jpg
Notes

  • Use a 4K direct mapped L1 cache

    • Hopefully this should lead to L2 accesses!

  • Use a 16 way 256 KB L2 cache

    • Hopefully enough ways to make a difference!

  • Compare PLRU with LRU, FIFO and Random

  • Sufficient to do this experiment for cc and anagram!


Perfect mem disambiguation l.jpg
Perfect Mem Disambiguation

  • Memory Disambiguation

    • Techniques employed by processor to execute loads/stores out of order

    • Use a HW structure called Load/Store queue

      • Tracks addresses / values of loads and stores

  • Load can be issued from LSQ

    • If there are no prior stores writing to the same address.

    • If address in unknown, then can’t issue load

  • Perfect Disambiguation

    • All addresses are “known”


How are addresses known l.jpg
How are addresses known

  • Two ways to do this:

    • Trace based: Run once and collect and remember all the addresses

    • All registers values are actually “known” to the simulator through functional simulation

      • Even though a register is “yet to be computed”, the simulator knows the value

  • Look at lsq_refresh() function in sim-outorder.c

  • To give you flexibility to do both ways

    • Simulate only a million instructions

    • Fast forward 100 million instructions


Mem disambiguation l.jpg
Mem Disambiguation

  • Compare CPI with and without perfect disambiguation

  • Sufficient to do this for cc and go

  • -fastfwd 100 million instructions

  • Simulate for additional 1 million instructions