Coe 308
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

COE 308 PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on
  • Presentation posted in: General

COE 308. Term - 051 Dr Abdelhafid Bouhraoua Performance. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Consequence: To Be Able To Adequately Choose The Right Computer

Download Presentation

COE 308

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


COE 308

Term - 051

Dr Abdelhafid Bouhraoua

Performance


Need for Performance

Goal:

To Have Some Predictability Over

Computer Usage


Need for Performance

Goal:

To Have Some Predictability Over

Computer Usage

Consequence:

To Be Able To Adequately

Choose The Right Computer

For A Given Application


Examples where Performance is needed

  • High Accessibility

  • Data-Base Server

  • Web Server

  • Banking System

  • High Speed

  • Astronomy

  • Genetic Research

  • Weather Prediction

  • Low Cost

  • POS Terminal

  • Portable Device

  • Cell Phone

  • Embedded Apps

  • (Appliances, Toys, …)


Defining Performance

  • Speed ?

  • Accessibility ?

  • Cost ?


Defining Performance

  • Speed ?

  • Accessibility ?

  • Cost ?

Only Speed Is Considered in This Context


What Speed ?

Which Plane Has Higher Performance ?


What Speed ?

Which Plane Has Higher Performance ?

  • Time to do the task (Execution Time)

    – execution time, response time, latency

  • Tasks per day, hour, week, sec, ns. .. (Performance)

    – throughput, bandwidth

    Response time and throughput often are in opposition


Definitions

  • Performance is in units of things-per-second

    • bigger is better

  • If we are primarily concerned with response time:

Performance(x) = 1/Execution_time(x)

" X is n times faster than Y" means:

Performance(X)

n = -----------------------------------------

Performance(Y)


Throughput and Response Time

  • Time of Concorde vs. Boeing 747?

    • Concord is 1350 mph / 610 mph = 2.2 times faster

      = 6.5 hours / 3 hours

  • Throughput of Concorde vs. Boeing 747 ?

    • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”

    • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”

  • Boeing is 1.6 times (“60%”)faster in terms of throughput

  • Concord is 2.2 times (“120%”) faster in terms of flying time

    We will focus primarily on execution time for a single job


Relative Performance

Computer A is n Times Faster

Than Computer B if:


Relative Performance

Computer A is n Times Faster

Than Computer B if:

Performance A

----------------------------------------- = n

Performance B


Relative Performance

Computer A is n Times Faster

Than Computer B if:

Performance A

----------------------------------------- = n

Performance B

Or

Execution Time B

------------------------------------------ = n

Execution Time A


Metrics and their Relation

Most Basic Metrics: Clock Cycles, Clock Cycle Time, CPU Time, # of Instructions per program

CPU Time = CPU Clk Cycles/Program * Clk Cycle Time

CPU Clk Cycles/Program

CPU Time = -----------------------------------------------------------------------------------

Clock Rate (Frequency)

CPU Cycles/Program = Instr./Program x Average Cycles/Inst.


CPI =

CPI (Cycles Per Instruction)

Average Cycles Per Instruction

CPI = (CPU Time /Clock Cycle Time) / Instruction Count

= Clock Cycles / Instruction Count

n: number of instructions in the Instruction Set

CPIi: number of clock cycles Instruction i takes to execute

Ii: Count of instructions of type i in the program

CPU time = Clock Cycle Time *

CPI = Clock Cycles / Instruction Count

Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI

Fi: Frequency of Instructions

Fi = Ii /Instruction Count


CPI =

CPI (Cycles Per Instruction)

Average Cycles Per Instruction

CPI = (CPU Time /Clock Cycle Time) / Instruction Count

= Clock Cycles / Instruction Count

n: number of instructions in the Instruction Set

CPIi: number of clock cycles Instruction i takes to execute

Ii: Count of instructions of type i in the program

CPU time = Clock Cycle Time *

CPI = Clock Cycles / Instruction Count

Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI

Fi: Frequency of Instructions

Fi = Ii /Instruction Count

Invest Resource Where Time Is Spent


Metrics and their Relation- Revisited -

Seconds

CPU TIME = -------------------------

Program

Instructions Cycles Seconds

CPU TIME = ----------------------------------- X -------------------------------- X -------------------------

Program Instruction Cycle

Implementation/

Compiler Optimization

Dependant

CPI - Variable

Clock Cycle – Fixed


Example

  • Example (RISC processor)

  • Typical Mix

  • Base Machine (Reg / Reg)

    • Op Freq CPI(i) CPI(i) x Freq

    • ALU 50% 1 .5

    • Load 20% 5 1.0

    • Store 10% 3 .3

    • Branch 20% 2 .4

  • How much faster would the machine be if a better data cache

  • reduced the average load time to 2 cycles?

  • How does this compare with using branch prediction to shave a

  • cycle off the branch time?

  • What if two ALU instructions could be executed at once?


  • Answering 1.

    • Computing the CPI Before Improvement:

      • Op Freq CPI(i) CPI(I) x Freq

      • ALU 50% 1 .5

      • Load 20% 5 1.0

      • Store 10% 3 .3

      • Branch 20% 2 .4

      • -----------

  • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 2.2

    • Computing the CPI After Improvement:

      • Op Freq CPI(i) CPI(i) x FreQ

      • ALU 50% 1 .5

      • Load 20% 2 .4

      • Store 10% 3 .3

      • Branch 20% 2 .4

      • -----------

  • CPI2 = .5x1 + .2x2 + .1%x3 +.2x2 = 1.6


  • Answering 1. (cont.)

    How much faster would the machine be if a better data cache

    reduced the average load time to 2 cycles?

    Answer:

    It is n times faster with:


    Answering 1. (cont.)

    How much faster would the machine be if a better data cache

    reduced the average load time to 2 cycles?

    Answer:

    It is n times faster with:

    n = CPU Time Before Imp. / CPU Time After Imp.

    = Clock Cycle Time * CPI1 * Inst. Count /

    Clock Cycle Time * CPI2 * Inst. Count

    = CPI1 / CPI2 = 2.2 / 1.6 = 1.375


    Answering 1. (cont.)

    • How much faster would the machine be if a better data cache

    • reduced the average load time to 2 cycles?

    • Answer:

    • It is n times faster with:

    • n = CPU Time Before Imp. / CPU Time After Imp.

    • = Clock Cycle Time * CPI1 * Inst. Count /

    • Clock Cycle Time * CPI2 * Inst. Count

    • = CPI1 / CPI2 = 2.2 / 1.6 = 1.375

    • We Say:

    • CPU is 1.375 times faster, or

    • CPU is 37.50% faster


    Answering 2.

    How does this compare with using branch prediction to shave a

    cycle off the branch time?

    Answer:

    “Shaving” a cycle off the branch time means CPI of branch

    is reduced by one cycle

    • Computing the CPI After Improvement:

      • Op Freq CPI(I) CPI(i) x Freq

      • ALU 50% 1 .5

      • Load 20% 5 1.0

      • Store 10% 3 .3

      • Branch 20% 1 .2

      • -----------

  • CPI2 = .5x1 + .2x5 + .1%x3 +.2x1= 2.0

  • Reducing the Load time produces better performances than

    reducing the branch time


    Answering 3.

    What if two ALU instructions could be executed at once?

    Answer:

    Two instructions executed at once means:

    For one instruction, it takes virtually half the time to execute

    on machine B. So,

    CPI(i)B = CPI(i)A/2

    • Computing the CPI of Machine B

      • Op Freq CPI(i) CPI(I) x Freq

      • ALU 50% .5 .25

      • Load 20% 5 1.0

      • Store 10% 3 .3

      • Branch 20% 2 .4

      • -----------

  • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 1.95


  • Time % Evaluation

    How to determine which class of instructions takes the

    highest time ?

    • Evaluate Time Percentages of Instructions

    • Cannot be Directly Measured (Program has Mixed Instructions)

    • Need to be Computed Using CPI and Frequency


    Time % Evaluation

    • Given:

    • Ic: Instruction Count

    • Ii: Instruction Count for Instruction Class i

    • Fi: Frequency of Instructions of Class i

    • Tc: Clock Cycle Time

    • CPIi: Clock Cycles/Instruction for Class i

    • CPI: Average Clock Cycles / Instruction for the whole program

    • Pi: Percentage of time for instruction of Class i

    CPUtime= CPI x Ic x Tc

    CPUtimei= CPIi x Ii x Tc

    Ii = Ic x Fi

    CPUtimei= CPIi x Ic x Fi x Tc

    Pi = CPUtimei / CPUtime

    Pi = CPIi x Ic x Fi x Tc / (CPI x Ic x Tc)

    CPIi x Fi

    CPI

    Pi =


    Amdahl’s Law

    Speed-up due to Enhancement E


    Amdahl’s Law

    Speed-up due to Enhancement E

    Execution Time w/o E Performance w/ E

    Speedup = --------------------------------- = -----------------------------

    Execution Time w/ E Performance w/o E


    Amdahl’s Law

    Speed-up due to Enhancement E

    Execution Time w/o E Performance w/ E

    Speedup = --------------------------------- = -----------------------------

    Execution Time w/ E Performance w/o E

    Suppose that Enhancement E accelerate a portion F Only

    by a factor S

    TFE

    TA

    TFA

    TE


    Amdahl’s Law

    New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged

    TE = TA – TFA + TFETA – TFA is unchanged

    TFA = TA * FF is a fraction of TA

    TFE = TFA/S = TA * F/STime is reduced by a factor S

    TE = TA – TA*F + TA * F/S

    Means:


    Amdahl’s Law

    New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged

    TE = TA – TFA + TFETA – TFA is unchanged

    TFA = TA * FF is a fraction of TA

    TFE = TFA/S = TA * F/STime is reduced by a factor S

    TE = TA – TA*F + TA * F/S

    Means:

    1

    ------------------

    (1-F + (F/S))

    Speedup =

    TE = TA * (1 – F + (F/S))


    Benchmarks

    • Few users run same program over and over

    • Need Programs specially developed to compare performance

    • Best Reference:Real Application

    • Real Application NOT common to all users

    Benchmarks are Programs developed for the sole purpose of Performance Evaluation


    Typical Workload


    Full Application Benchmark


    Small Benchmarks


    SPEC95

    • Eighteen application benchmarks (with inputs) reflecting a technical computing workload

    • Eight integer

      • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

    • Ten floating-point intensive

      • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5

    • Must run with standard compiler flags

      • eliminate special undocumented incantations that may not even generate working code for real programs


    Fallacies and Pitfalls

    • Amdahl’s law sets limits only and is NOT unlimited

      • Improvement of one aspect cannot improve the overall performance by a factor proportional to the size of the improvement

    • Hardware-independent metrics DO NOT predict performance

      • Code size, Impl. of software systems

    • Using MIPS (Millions of Inst. Per Second) as a performance metric

      • Instructions have different CPI

      • MIPS metric vary from one program to the other on the SAME CPU.


  • Login