Computer Architecture

Computer Architecture Lab 3 Prof. Jerry Breecher CS 240 Fall 2003 Lab 3

What you will do in this lab. The purpose of this lab is to let you use some of the concepts you’ve acquired about Pipeline Hurdles. For this lab you will be producing/using structural and data hazards; you will see what they are and how they can have an effect on the performance of a computer. You have five tasks before you: • Produce a program that works with structural hazards. This program uses a floating point divide and examines the structural hazards inherent in that instruction. • The second task challenges you to write code that will move data as quickly as possible. You can match wits with the processor in a challenge to produce the fastest operating code. • The third task asks you to examine the effects of READ AFTER WRITE data hazards. • Determine the penalty that results from a Branch Misprediction. • Read the Intel manual and examine methods of improving performance. Lab 3

What you will do in this lab. What is a verbal lab? You prepare, document and tie up all the pieces of your lab just as if you were handing it in. Instead you and your teammate talk over the results with me. The discussion will be professional, the way I would talk about a problem with a junior colleague. Lab 3

Where To Get Documentation There is an absolutely stupendous manual devoted to the Pentium4 architecture (which is what we have in the lab) “Pentium 4 and Xeon Processor Optimization” http://developer.intel.com/design/pentium4/manuals/248966.htm Local copy at: http://babbage.clarku.edu/~jbreecher/docs/Intel_Arch_Optimization.pdf In this manual, Chapter 1 contains lots of great information about The Pipeline used in the processors. Chapter 2 contains guidelines for Optimizing Performance. There are also excellent coding examples throughout. Lab 3

Sample Code /* This is code for Task 1. */ double c; main( ) { double a = 3, b = 7; int i; for ( i = 0; i < 10000000; i++ ) c = a / b; } /* This is code for Task 2. */ #define BYTES 4096 char A[BYTES]; char B[BYTES]; main( ) { int i; for ( i = 0; i < BYTES; i++ ) A[i] = B[i]; } Here are some very simple pieces of C code to help you get started on this lab. About optimization level –O3: Using optimization MAY remove code from inside a loop. In the loop for Task 1, notice that there’s no reason to do the divide 10 million times! It could simply be calculated once outside the loop. Sometimes the compiler is smart enough to know that and remove the loop entirely! Optimizations can obviously make a big difference in timings. Because an optimized loop should take less time, and change in the loop will be more noticable. Lab 3

Task 1: The floating point divide instruction requires many cycles to complete. We’re interested in what goes on with the processor for this instruction. Write a floating point divide in Assembler or C or from a .s file produced by gcc. It might be easiest to write a simple C program in order to get the basic structure. An example is given on the sample code page. • How many cycles does the instruction take? • How many “nop” instructions can you place after the divide without causing the cycle time of the code loop to increase? Do these instructions execute even though the Floating Point instruction is not completed? • How long after the start of the divide is the result of the instruction available? • What else can you find out about a divide? For instance, is it possible to do two floating point instructions at the same time? Lab 3

Task 2: The time to move a large block of data is very dependent on the way the code is written. This can be interpreted as a structural hazard in that there’s only one memory pipe - there’s only one way to get data from the memory/cache into registers. Alternatively this can be seen as a data hazard - the result of the load instruction needs to be provided to subsequent instructions. However your interpretation, your task is to write the best (meaning fastest) possible code to move data. • Write code in Assembler or C that will move 4096 bytes of data from one location in memory to another. The action is defined in the program on the sample code page. Note that this sample program will NOT produce high-performing code. • Note that the kind of code generated with gcc –S –O3 xx.c is very different from that in gcc –S xx.c. What do the –On flags do for you? They should be very useful in this and later exercises. • You might want to look at Appendix E of the manual for some really spiffy ways of doing this assignment. Whether you choose to use this method or not is up to you. Lab 3

Task 2: • Now optimize the code so as to minimize the time to move the data. How many microseconds does it take to move 4096 bytes? How many cycles? • Explain where the clock cycles go. Why, with the instructions in your loop do you get the number of cycles you do. This is an accounting question. Lab 3

Task 3 We’ve learned in class that a Read After Write (RAW) stall occurs when an instruction produces (writes) a register that is used by (read) a later instruction. In this section, we will examine this behavior in more detail on an Intel processor. 1. Write assembler code with a RAW dependence. 2. What is the delay ( in cycles ) resulting from this RAW? To answer this question, you will write code with and without the dependence and measure the difference in timing. 3. Read about the Intel pipeline and justify your result. You learned in class about pipelines, but all the examples there were using the MIPs pipeline. Your challenge now is to transfer your classroom experience as well as your lab results into an understanding of the Intel pipeline. Lab 3

Task 4 - Introduction The purpose of this task is to determine the cost of branches. What does it cost when the processor successfully handles the branch based on its previous experience, and what does it cost when it has no history or an inaccurate history that forces it to figure out the branch from scratch. As we’ve learned in class, there are branch caches and other aids to aid the processor by maintaining previous history. Pages 1-14 and 2-13 of the Processor Optimization Manual describe the implementation of Intel Branching. Your goal in simple terms is this: Fill in this table which describes the number of cycles lost due to various branch circumstances. Lab 3

Task 4 - Introduction In order to accomplish this task you’ll need at least three tools. • One or more programs that will generate and time branches. These programs will be able to do branches from within a loop so that you can easily determine how many have been done. They must also be able to fool the processor so that it miss-predicts the result of a branch. (I’m giving you a starting program -- branch.c -- to do that.) • You need a processor that’s instrumented to count the number of branches executed (retired in Intel lingo), along with various other information, such as miss predicted, taken/not taken, and so on. Fortunately the Intel and AMD processors we have are capable of doing this. • A program that can extract this branch information from the processor. Lab 3

Task 4 – branch.c The program you will use to generate branches, both predicted and not predicted is called branch.c. This program is available with the path ~jbreecher/public/arch/labs/branch.c I’ve found that compiling it unoptimized makes the code much more readable – in fact a clever compiler can eliminate many of the branches entirely. gcc –g branch.c –o branch The program as described here as a total of 4 branches: 1 conditional branch that is almost always taken 1 conditional branch that may be taken or not  this is the interesting one! 2 unconditional branches that are NEVER mispredicted but may have additional costs associated with them simply because they are branches. Branch.c can be called with three possible options: 0. The processor successfully predicts the branch is NOT taken • The processor successfully predicts the branch IS taken • The processor can not predict if the branch is taken or not. Lab 3

Here’s the essential code for doing the branch misprediction. It’s a program that fools the hardware by sometimes taking a branch and sometimes not taking it. Task 4 – branch.c  Prepare the array that guides how the branch will occur. branch_type is 0, 1, or 2 for ( index = 0; index < NUM_DIR; index++ ) { if ( branch_type == 0 ) direction[index] = 0; if ( branch_type == 1 ) direction[index] = 1; if ( branch_type == 2 ) direction[index] = rand() % 2; } get_current_time( &start_time ); for ( iters = 0; iters < iterations; iters++ ) { for ( index = 0; index < NUM_DIR; index++ ) { if ( direction[index] == 0 ) global = global + 3; else global = global + 2; } /* End of for index */ } /* End of for iters */ get_current_time( &end_time );  Note random number here  Start Timing  Here’s the interesting branch  Stop Timing Lab 3

Task 4 – branch.c addl $16,%esp # Set up outer loop movl $0,-16(%ebp) .L11: movl -16(%ebp),%eax # this “if” is part of outer loop cmpl -8(%ebp),%eax jb .L14 jmp .L12 .L14: movl $0,-12(%ebp) # Set up inner loop .L15: cmpl $4095,-12(%ebp) # this “if” is part of inner loop jle .L18 # BRANCH Number 1 jmp .L13 .L18: leal -4112(%ebp),%eax # if part of branch movl -12(%ebp),%edx cmpb $0,(%edx,%eax) jne .L19 # BRANCH Number 2 The INTERESTING one. addl $3,global # Add 3 to global jmp .L17 # BRANCH Number 3 - unconditional .L19: addl $2,global # Add 2 to global .L20: .L17: incl -12(%ebp) # Increment inner loop counter jmp .L15 # BRANCH Number 4 - unconditional .p2align 4,,7 .L16: .L13: incl -16(%ebp) # Increment outer loop counter jmp .L11 .L12: From .s file • Disassembled Code – This is debuggable (not optimized) code. Lab 3

0x804861f <main+335>: add $0x10,%esp 0x8048622 <main+338>: movl $0x0,0xfffffff0(%ebp) 0x8048629 <main+345>: lea 0x0(%esi,1),%esi  This is a dummy 0x8048630 <main+352>: mov 0xfffffff0(%ebp),%eax 0x8048633 <main+355>: cmp 0xfffffff8(%ebp),%eax 0x8048636 <main+358>: jb 0x8048640 <main+368> 0x8048638 <main+360>: jmp 0x8048685 <main+437>  Done with loop 0x804863a <main+362>: lea 0x0(%esi),%esi 0x8048640 <main+368>: movl $0x0,0xfffffff4(%ebp) 0x8048647 <main+375>: cmpl $0xfff,0xfffffff4(%ebp)  Compare with 4096 0x804864e <main+382>: jle 0x8048652 <main+386> 0x8048650 <main+384>: jmp 0x8048680 <main+432> 0x8048652 <main+386>: lea 0xffffeff0(%ebp),%eax 0x8048658 <main+392>: mov 0xfffffff4(%ebp),%edx 0x804865b <main+395>: cmpb $0x0,(%edx,%eax,1) 0x804865f <main+399>: jne 0x8048670 <main+416> 0x8048661 <main+401>: addl $0x3,0x8049990  Add 3 to global 0x8048668 <main+408>: jmp 0x8048677 <main+423> 0x804866a <main+410>: lea 0x0(%esi),%esi 0x8048670 <main+416>: addl $0x2,0x8049990  Add 2 to global 0x8048677 <main+423>: incl 0xfffffff4(%ebp) 0x804867a <main+426>: jmp 0x8048647 <main+375> 0x804867c <main+428>: lea 0x0(%esi,1),%esi 0x8048680 <main+432>: incl 0xfffffff0(%ebp) 0x8048683 <main+435>: jmp 0x8048630 <main+352> Task 4 – branch.c From disassemble • Disassembled Code – This is debuggable (not optimized) code. Same as last page. Lab 3

Task 4 – branch.c Here’s the initial analysis of this program. It contains information about the 4 branches and what action is taken in each of the three possible modes. Think of this as a simultaneous equation – you have a certain number of unknowns along with measured numbers. Your goal is to solve the equation. Lab 3

Task 4 – Processor Instrumentation The Intel and AMD processors we have contain counters that can be activated. Here are some simple factoids about them: • You can activate and gather information on only two counters at a time. • Here’s how they work – you tell the processor “Interrupt the OS when N instances of this operation have occurred. So, suppose I want to gather how many instructions have completed. I set the counter with Type = INST_RETIRED and a Count = 100000. Then every Count instructions, the processor interrupts. The OS gathers how many interrupts have occurred. • The number of interrupts that occurred is what is reported to you. To get the total actions that occurred, you need to multiply Count * Interrupts. • Here are some interesting counters that you will want to understand. These are explained in Intel Processor Manual #3 - Appendix A. INST_RETIRED UOPS_RETIRED BR_INST_RETIRED BR_MISS_PRED_RETIRED BR_TAKEN_RETIRED BR_MISS_PRED_TAKEN_RET BTB_MISSES BR_INST_DECODED Lab 3

Note – when running any timing program, you want to ensure you are the only person on the machine. Task 4 - Oprofile Oprofile is the tool that you will use to gather counter information. It consists of several components – a user level interface including programs to interpret and analyze user source code, and a Linux driver that talks to the processor chip. For this lab, we will be using only very simple modes of Oprofile. We’ll use more advanced features in future labs. You can read more about Oprofile in the document ~jbreecher/public/docs/Oprofile.html After considerable trial and error, I have determined that the following set of commands is one way that produces results (the documentation is certainly not adequate!) rm ~\.oprofile/daemonrc opcontrol --init opcontrol --reset opcontrol --no-vmlinux opcontrol --event=INST_RETIRED:100000 --event=UOPS_RETIRED:100000 opcontrol --start sleep 30 opcontrol --shutdown opreport event:INST_RETIRED | grep branch opreport event:UOPS_RETIRED | grep branch opcontrol --reset To run these programs you must be root. You should be able to say “sudo opcontrol”, etc. Talk with Jerome. These are 2 of the 8 counters you will want to measure. Run this in a macro after first starting up the branch program to run “forever” as a background task. Augment these commands to include the other processor counters. Lab 3

Task 4 - A Sample Calculation When running the branch program I got the following results: napier% branch 0 100000 Inputs: Branch Type: 0 Iterations: 100,000 Elapsed seconds is 3.473632 for 409,600,000 Loops Counted INST_RETIRED events (number of branch instructions retired) with a unit mask of 0x00 (No unit mask) count 100000 396739 98.9524 branch Calculation: Instructions/Sec = 409,600,000 Loops * 10 Inst/Loop / 3.47 Sec. = 1.18Billion Inst/Sec. Calculation: Instructions/Sec = 396,739 Interrupts * 100,000 Inst./Interrupt / 30.00 Sec. = 1.32 Billion Inst/Sec. The calculations should give the same number, but life isn’t always so simple. Calculation for Cycles/Loop: Loops/Second = 1.32 Billion Inst/Sec / 10 Inst/Loop = 132 Million Loop/Sec. Cycles/Second = 1 GHz = 1 Billion/Sec Cycles/Loop = 1 Billion/Sec / 132 Million/Sec = 7.6 Lab 3

Task 4 – A Sample Calculation The result of the calculation on the last page lets us fill in a number in our table. So in this fashion, you can determine the cost of branches, both predicted and non-predicted. I was able to fill in the table by using the processor counters to figure out how many of various kinds of branches occurred in each loop. Lab 3

Task 5 The Intel Optimization Manual contains a wealth of information: Chapter 1 talks about the Intel processor pipeline. Chapter 2 gives numerous optimization hints for the Intel processor. • Read these chapters and devise your own performance experiment involving structural or data hazards. (The manual talks about other types of performance improvements, but we’ll save that for later labs.) • Report your results and how the performance changes as a result of your experiments. Here’s an example: On the top of Page 1-5, in the section on decoders, there’s a discussion about arranging assembler instructions based on the number of mops in those instructions. Your solution could be to write two programs. One program follows the guidelines on pg. 1-5. One program purposely violates the guidelines. What is the performance difference between the two programs? Lab 3

Computer Architecture