Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st

1 / 28

# Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st - PowerPoint PPT Presentation

Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 start. Performance Summary. Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Lec 3 Sept 2

• complete Chapter 1
• exercises from Chapter 1
• quiz # 1
• Chapter 2 start
Performance Summary
• Performance depends on
• Algorithm: affects IC, possibly CPI
• Programming language: affects IC, CPI
• Compiler: affects IC, CPI
• Instruction set architecture: affects IC, CPI, Tc

The BIG Picture

Exercise 1.2.1

For a color display using 8 bits for each primary color (R, G, B) per pixel and with a resolution of 1280 x 800 pixels, what should be the size (in bytes) of the frame buffer to store a frame?

Each frame requires 1280 x 800 x 3 = 3072000 ~ 3 Mbytes

If a computer has 3 GB memory to store such frames, how many frames can be stored?

3 x 109 / 3 x 106 ~ 1000 frames

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

1.3.1. Which processor has the highest performance?

Suppose the program has N instructions.

Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N x 10-9

Time taken to execute on P2 is = N/ (1.5 x 109) = 0.66 N x 10-9

Time taken to execute on P3 is = 2.5 N/ (3 x 109) = 0.83 N x 10-9

Time taken to execute on P2 is = N/ (1.5 x 109) = 0.66 N x 10-9

Time taken to execute on P3 is = 2.5 N/ (3 x 109) = 0.83 N x 10-9

P2 has the best performance (since it takes the least time to execute).

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

1.3.2. If the processors each execute a program in 10 seconds, find the number of cycles and the number of instructions.

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

1.3.2. If the processors each execute a program in 10 seconds, find the number of cycles and the number of instructions.

Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N1 x 10-9

= 10

So N1 = 1.33 x 1010

Exercise 1.4.3

Given below are the number of instructions of a program:

500 50 100 50 700

Assuming the instructions take 1, 5, 5 and 2 cycles, what is the execution time in a 2 GHz processor?

Exercise 1.4.3

Given below are the number of instructions of a program:

500 50 100 50 700

Assuming the instructions take 1, 5, 5 and 2 cycles, what is the execution time in a 2 GHz processor?

Solution: time to execute = cycle time x CPI x no. of inst

Cycle time = 1/(2 x 10-9)

CPI = (500/700 + 50 x 5/700 + 100 x 5/700 + 50 x 2/700)

So the total time = 675 x 10-9 sec

Exercise 1.6

• Compilers have a profound impact on the performance of an application on a given processor. This problem will explore the impact compilers have on execution time:.
• compiler A compiler B
• no instructions exec. Time no. instructions exec. Time
• 1.0 x 109 1 s 1.2 x 109 1.4 s
• (b) 1.4 x 109 0.8 s 1.2 x 109 0.7 s

Find the average CPI for each program given that the processor has a cycle time of 1 ns.

Exercise 1.6

• Compilers have a profound impact on the performance of an application on a given processor. This problem will explore the impact compilers have on execution time:.
• compiler A compiler B
• no instructions exec. Time no. instructions exec. Time
• 1.0 x 109 1 s 1.2 x 109 1.4 s
• (b) 1.4 x 109 0.8 s 1.2 x 109 0.7 s

Find the average CPI for each program given that the processor has a cycle time of 1 ns.

Exec. Time = CPI x cycle time x no. of inst

(a) Compiler A: CPI = 1/ (10-9 x 109 ) = 1

Power Trends

§1.5 The Power Wall

• In CMOS IC technology

×30

5V → 1V

×1000

Reducing Power
• Suppose a new CPU has
• 85% of capacitive load of old CPU
• 15% voltage and 15% frequency reduction
• The power wall
• We can’t reduce voltage further
• We can’t remove more heat
• How else can we improve performance?
Exercise 1.7

1.7.4. Given the following information about each processor, calculate its capacitive load:

Processor 80286: clock rate = 12.5 MHz

power = 3.3 W

voltage = 5 V

Solution: Use the equation

power = capacitive load x voltage2 x clock rate

Capacitive load = 3.3 / (5 x 5 x 12.5) x 10-6 = 0.01056 x 10-6

Uniprocessor Performance

§1.6 The Sea Change: The Switch to Multiprocessors

Constrained by power, instruction-level parallelism, memory latency

Multiprocessors

General-purpose uni-cores have reached limits of historic performance scaling

􀂄 Power consumption

􀂄 Wire delays

􀂄 DRAM access latency

􀂄 Diminishing returns of more instruction-level parallelism

Slide from Prof. Saman Amarasinghe

Multiprocessors
• Multicore microprocessors
• More than one processor per chip
• Requires explicitly parallel programming
• Compare with instruction level parallelism
• Hardware executes multiple instructions at once
• Hidden from the programmer
• Hard to do
• Programming for performance
• Optimizing communication and synchronization
Manufacturing ICs
• Yield: proportion of working dies per wafer

§1.7 Real Stuff: The AMD Opteron X4

Integrated Circuit Cost
• Nonlinear relation to area and defect rate
• Wafer cost and area are fixed
• Defect rate determined by manufacturing process
• Die area determined by architecture and circuit design
SPEC CPU Benchmark
• Programs used to measure performance
• Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
• Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
• Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance
• Normalize relative to reference machine
• Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)
CINT2006 for Opteron X4 2356

High cache miss rates

Amdahl’s Law

s =

min(p, 1/f)

1

f+(1–f)/p

f = fraction

unaffected

p = speedup

of the rest

Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast.

Amdahl’s Law in design

Example

• A processor spends 30% of its time on flp addition, 25% on flp mult,
• and 10% on flp division. Evaluate the following enhancements, each
• costing the same to implement:
• Redesign of the flp adder to make it twice as fast.
• Redesign of the flp multiplier to make it three times as fast.
• Redesign the flp divider to make it 10 times as fast.
Amdahl’s Law in design

Example

• A processor spends 30% of its time on flp addition, 25% on flp mult,
• and 10% on flp division. Evaluate the following enhancements, each
• costing the same to implement:
• Redesign of the flp adder to make it twice as fast.
• Redesign of the flp multiplier to make it three times as fast.
• Redesign the flp divider to make it 10 times as fast.
• Solution
• Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18
• Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
• Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10
• What if both the adder and the multiplier are redesigned?
Amdahl’s Law – limit to improvement
• Improving an aspect of a computer and expecting a proportional improvement in overall performance

§1.8 Fallacies and Pitfalls

• Example: multiply accounts for 80s/100s
• How much improvement in multiply performance to get 5× overall?
• Can’t be done!
• Corollary: make the common case fast
Pitfall: MIPS as a Performance Metric
• MIPS: Millions of Instructions Per Second
• Doesn’t account for
• Differences in ISAs between computers
• Differences in complexity between instructions
• CPI varies between programs on a given CPU
Concluding Remarks
• Cost/performance is improving
• Due to underlying technology development
• Hierarchical layers of abstraction
• In both hardware and software
• Instruction set architecture
• The hardware/software interface
• Execution time: the best performance measure
• Power is a limiting factor
• Use parallelism to improve performance

§1.9 Concluding Remarks