slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st PowerPoint Presentation
Download Presentation
Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st

Loading in 2 Seconds...

play fullscreen
1 / 28

Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st - PowerPoint PPT Presentation


  • 574 Views
  • Uploaded on

Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 start. Performance Summary. Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Lec 3 Sept 2 complete Chapter 1 exercises from Chapter 1 quiz # 1 Chapter 2 st


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Lec 3 Sept 2

  • complete Chapter 1
  • exercises from Chapter 1
    • quiz # 1
  • Chapter 2 start
performance summary
Performance Summary
  • Performance depends on
    • Algorithm: affects IC, possibly CPI
    • Programming language: affects IC, CPI
    • Compiler: affects IC, CPI
    • Instruction set architecture: affects IC, CPI, Tc

The BIG Picture

slide3

Exercise 1.2.1

For a color display using 8 bits for each primary color (R, G, B) per pixel and with a resolution of 1280 x 800 pixels, what should be the size (in bytes) of the frame buffer to store a frame?

Each frame requires 1280 x 800 x 3 = 3072000 ~ 3 Mbytes

If a computer has 3 GB memory to store such frames, how many frames can be stored?

3 x 109 / 3 x 106 ~ 1000 frames

slide4

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

slide5

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

1.3.1. Which processor has the highest performance?

Suppose the program has N instructions.

Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N x 10-9

Time taken to execute on P2 is = N/ (1.5 x 109) = 0.66 N x 10-9

Time taken to execute on P3 is = 2.5 N/ (3 x 109) = 0.83 N x 10-9

slide6

Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N x 10-9

Time taken to execute on P2 is = N/ (1.5 x 109) = 0.66 N x 10-9

Time taken to execute on P3 is = 2.5 N/ (3 x 109) = 0.83 N x 10-9

P2 has the best performance (since it takes the least time to execute).

slide7

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

1.3.2. If the processors each execute a program in 10 seconds, find the number of cycles and the number of instructions.

slide8

Exercise 1.3

Consider 3 processors P1, P2 and P3 with same instruction set with clock rates and CPI given below:

clock rate CPI

P1 2 GHz 1.5

P2 1.5 GHz 1.0

P3 3 GHz 2.5

1.3.2. If the processors each execute a program in 10 seconds, find the number of cycles and the number of instructions.

Time taken to execute on P1 is = 1.5 N / (2 x 109) = 0.75 N1 x 10-9

= 10

So N1 = 1.33 x 1010

slide9

Exercise 1.4.3

Given below are the number of instructions of a program:

arith store load branch total

500 50 100 50 700

Assuming the instructions take 1, 5, 5 and 2 cycles, what is the execution time in a 2 GHz processor?

slide10

Exercise 1.4.3

Given below are the number of instructions of a program:

arith store load branch total

500 50 100 50 700

Assuming the instructions take 1, 5, 5 and 2 cycles, what is the execution time in a 2 GHz processor?

Solution: time to execute = cycle time x CPI x no. of inst

Cycle time = 1/(2 x 10-9)

CPI = (500/700 + 50 x 5/700 + 100 x 5/700 + 50 x 2/700)

So the total time = 675 x 10-9 sec

slide11

Exercise 1.6

  • Compilers have a profound impact on the performance of an application on a given processor. This problem will explore the impact compilers have on execution time:.
  • compiler A compiler B
  • no instructions exec. Time no. instructions exec. Time
  • 1.0 x 109 1 s 1.2 x 109 1.4 s
  • (b) 1.4 x 109 0.8 s 1.2 x 109 0.7 s

Find the average CPI for each program given that the processor has a cycle time of 1 ns.

slide12

Exercise 1.6

  • Compilers have a profound impact on the performance of an application on a given processor. This problem will explore the impact compilers have on execution time:.
  • compiler A compiler B
  • no instructions exec. Time no. instructions exec. Time
  • 1.0 x 109 1 s 1.2 x 109 1.4 s
  • (b) 1.4 x 109 0.8 s 1.2 x 109 0.7 s

Find the average CPI for each program given that the processor has a cycle time of 1 ns.

Exec. Time = CPI x cycle time x no. of inst

(a) Compiler A: CPI = 1/ (10-9 x 109 ) = 1

power trends
Power Trends

§1.5 The Power Wall

  • In CMOS IC technology

×30

5V → 1V

×1000

reducing power
Reducing Power
  • Suppose a new CPU has
    • 85% of capacitive load of old CPU
    • 15% voltage and 15% frequency reduction
  • The power wall
    • We can’t reduce voltage further
    • We can’t remove more heat
  • How else can we improve performance?
exercise 1 7
Exercise 1.7

1.7.4. Given the following information about each processor, calculate its capacitive load:

Processor 80286: clock rate = 12.5 MHz

power = 3.3 W

voltage = 5 V

Solution: Use the equation

power = capacitive load x voltage2 x clock rate

Capacitive load = 3.3 / (5 x 5 x 12.5) x 10-6 = 0.01056 x 10-6

uniprocessor performance
Uniprocessor Performance

§1.6 The Sea Change: The Switch to Multiprocessors

Constrained by power, instruction-level parallelism, memory latency

slide17

Multiprocessors

General-purpose uni-cores have reached limits of historic performance scaling

􀂄 Power consumption

􀂄 Wire delays

􀂄 DRAM access latency

􀂄 Diminishing returns of more instruction-level parallelism

Slide from Prof. Saman Amarasinghe

multiprocessors
Multiprocessors
  • Multicore microprocessors
    • More than one processor per chip
  • Requires explicitly parallel programming
    • Compare with instruction level parallelism
      • Hardware executes multiple instructions at once
      • Hidden from the programmer
    • Hard to do
      • Programming for performance
      • Load balancing
      • Optimizing communication and synchronization
manufacturing ics
Manufacturing ICs
  • Yield: proportion of working dies per wafer

§1.7 Real Stuff: The AMD Opteron X4

integrated circuit cost
Integrated Circuit Cost
  • Nonlinear relation to area and defect rate
    • Wafer cost and area are fixed
    • Defect rate determined by manufacturing process
    • Die area determined by architecture and circuit design
spec cpu benchmark
SPEC CPU Benchmark
  • Programs used to measure performance
    • Supposedly typical of actual workload
  • Standard Performance Evaluation Corp (SPEC)
    • Develops benchmarks for CPU, I/O, Web, …
  • SPEC CPU2006
    • Elapsed time to execute a selection of programs
      • Negligible I/O, so focuses on CPU performance
    • Normalize relative to reference machine
    • Summarize as geometric mean of performance ratios
      • CINT2006 (integer) and CFP2006 (floating-point)
cint2006 for opteron x4 2356
CINT2006 for Opteron X4 2356

High cache miss rates

amdahl s law
Amdahl’s Law

s =

min(p, 1/f)

1

f+(1–f)/p

f = fraction

unaffected

p = speedup

of the rest

Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast.

amdahl s law in design
Amdahl’s Law in design

Example

  • A processor spends 30% of its time on flp addition, 25% on flp mult,
  • and 10% on flp division. Evaluate the following enhancements, each
  • costing the same to implement:
  • Redesign of the flp adder to make it twice as fast.
  • Redesign of the flp multiplier to make it three times as fast.
  • Redesign the flp divider to make it 10 times as fast.
amdahl s law in design25
Amdahl’s Law in design

Example

  • A processor spends 30% of its time on flp addition, 25% on flp mult,
  • and 10% on flp division. Evaluate the following enhancements, each
  • costing the same to implement:
  • Redesign of the flp adder to make it twice as fast.
  • Redesign of the flp multiplier to make it three times as fast.
  • Redesign the flp divider to make it 10 times as fast.
  • Solution
  • Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18
  • Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
  • Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10
  • What if both the adder and the multiplier are redesigned?
amdahl s law limit to improvement
Amdahl’s Law – limit to improvement
  • Improving an aspect of a computer and expecting a proportional improvement in overall performance

§1.8 Fallacies and Pitfalls

  • Example: multiply accounts for 80s/100s
    • How much improvement in multiply performance to get 5× overall?
  • Can’t be done!
  • Corollary: make the common case fast
pitfall mips as a performance metric
Pitfall: MIPS as a Performance Metric
  • MIPS: Millions of Instructions Per Second
    • Doesn’t account for
      • Differences in ISAs between computers
      • Differences in complexity between instructions
  • CPI varies between programs on a given CPU
concluding remarks
Concluding Remarks
  • Cost/performance is improving
    • Due to underlying technology development
  • Hierarchical layers of abstraction
    • In both hardware and software
  • Instruction set architecture
    • The hardware/software interface
  • Execution time: the best performance measure
  • Power is a limiting factor
    • Use parallelism to improve performance

§1.9 Concluding Remarks