Coe 308
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

COE 308 PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on
  • Presentation posted in: General

COE 308. Term - 051 Dr Abdelhafid Bouhraoua Performance. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Consequence: To Be Able To Adequately Choose The Right Computer

Download Presentation

COE 308

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Coe 308

COE 308

Term - 051

Dr Abdelhafid Bouhraoua

Performance


Need for performance

Need for Performance

Goal:

To Have Some Predictability Over

Computer Usage


Need for performance1

Need for Performance

Goal:

To Have Some Predictability Over

Computer Usage

Consequence:

To Be Able To Adequately

Choose The Right Computer

For A Given Application


Examples where performance is needed

Examples where Performance is needed

  • High Accessibility

  • Data-Base Server

  • Web Server

  • Banking System

  • High Speed

  • Astronomy

  • Genetic Research

  • Weather Prediction

  • Low Cost

  • POS Terminal

  • Portable Device

  • Cell Phone

  • Embedded Apps

  • (Appliances, Toys, …)


Defining performance

Defining Performance

  • Speed ?

  • Accessibility ?

  • Cost ?


Defining performance1

Defining Performance

  • Speed ?

  • Accessibility ?

  • Cost ?

Only Speed Is Considered in This Context


What speed

What Speed ?

Which Plane Has Higher Performance ?


What speed1

What Speed ?

Which Plane Has Higher Performance ?

  • Time to do the task (Execution Time)

    – execution time, response time, latency

  • Tasks per day, hour, week, sec, ns. .. (Performance)

    – throughput, bandwidth

    Response time and throughput often are in opposition


Definitions

Definitions

  • Performance is in units of things-per-second

    • bigger is better

  • If we are primarily concerned with response time:

Performance(x) = 1/Execution_time(x)

" X is n times faster than Y" means:

Performance(X)

n = -----------------------------------------

Performance(Y)


Throughput and response time

Throughput and Response Time

  • Time of Concorde vs. Boeing 747?

    • Concord is 1350 mph / 610 mph = 2.2 times faster

      = 6.5 hours / 3 hours

  • Throughput of Concorde vs. Boeing 747 ?

    • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”

    • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”

  • Boeing is 1.6 times (“60%”)faster in terms of throughput

  • Concord is 2.2 times (“120%”) faster in terms of flying time

    We will focus primarily on execution time for a single job


Relative performance

Relative Performance

Computer A is n Times Faster

Than Computer B if:


Relative performance1

Relative Performance

Computer A is n Times Faster

Than Computer B if:

Performance A

----------------------------------------- = n

Performance B


Relative performance2

Relative Performance

Computer A is n Times Faster

Than Computer B if:

Performance A

----------------------------------------- = n

Performance B

Or

Execution Time B

------------------------------------------ = n

Execution Time A


Metrics and their relation

Metrics and their Relation

Most Basic Metrics: Clock Cycles, Clock Cycle Time, CPU Time, # of Instructions per program

CPU Time = CPU Clk Cycles/Program * Clk Cycle Time

CPU Clk Cycles/Program

CPU Time = -----------------------------------------------------------------------------------

Clock Rate (Frequency)

CPU Cycles/Program = Instr./Program x Average Cycles/Inst.


Cpi cycles per instruction

CPI =

CPI (Cycles Per Instruction)

Average Cycles Per Instruction

CPI = (CPU Time /Clock Cycle Time) / Instruction Count

= Clock Cycles / Instruction Count

n: number of instructions in the Instruction Set

CPIi: number of clock cycles Instruction i takes to execute

Ii: Count of instructions of type i in the program

CPU time = Clock Cycle Time *

CPI = Clock Cycles / Instruction Count

Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI

Fi: Frequency of Instructions

Fi = Ii /Instruction Count


Cpi cycles per instruction1

CPI =

CPI (Cycles Per Instruction)

Average Cycles Per Instruction

CPI = (CPU Time /Clock Cycle Time) / Instruction Count

= Clock Cycles / Instruction Count

n: number of instructions in the Instruction Set

CPIi: number of clock cycles Instruction i takes to execute

Ii: Count of instructions of type i in the program

CPU time = Clock Cycle Time *

CPI = Clock Cycles / Instruction Count

Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI

Fi: Frequency of Instructions

Fi = Ii /Instruction Count

Invest Resource Where Time Is Spent


Metrics and their relation revisited

Metrics and their Relation- Revisited -

Seconds

CPU TIME = -------------------------

Program

Instructions Cycles Seconds

CPU TIME = ----------------------------------- X -------------------------------- X -------------------------

Program Instruction Cycle

Implementation/

Compiler Optimization

Dependant

CPI - Variable

Clock Cycle – Fixed


Example

Example

  • Example (RISC processor)

  • Typical Mix

  • Base Machine (Reg / Reg)

    • Op Freq CPI(i) CPI(i) x Freq

    • ALU 50% 1 .5

    • Load 20% 5 1.0

    • Store 10% 3 .3

    • Branch 20% 2 .4

  • How much faster would the machine be if a better data cache

  • reduced the average load time to 2 cycles?

  • How does this compare with using branch prediction to shave a

  • cycle off the branch time?

  • What if two ALU instructions could be executed at once?


  • Answering 1

    Answering 1.

    • Computing the CPI Before Improvement:

      • Op Freq CPI(i) CPI(I) x Freq

      • ALU 50% 1 .5

      • Load 20% 5 1.0

      • Store 10% 3 .3

      • Branch 20% 2 .4

      • -----------

  • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 2.2

    • Computing the CPI After Improvement:

      • Op Freq CPI(i) CPI(i) x FreQ

      • ALU 50% 1 .5

      • Load 20% 2 .4

      • Store 10% 3 .3

      • Branch 20% 2 .4

      • -----------

  • CPI2 = .5x1 + .2x2 + .1%x3 +.2x2 = 1.6


  • Answering 1 cont

    Answering 1. (cont.)

    How much faster would the machine be if a better data cache

    reduced the average load time to 2 cycles?

    Answer:

    It is n times faster with:


    Answering 1 cont1

    Answering 1. (cont.)

    How much faster would the machine be if a better data cache

    reduced the average load time to 2 cycles?

    Answer:

    It is n times faster with:

    n = CPU Time Before Imp. / CPU Time After Imp.

    = Clock Cycle Time * CPI1 * Inst. Count /

    Clock Cycle Time * CPI2 * Inst. Count

    = CPI1 / CPI2 = 2.2 / 1.6 = 1.375


    Answering 1 cont2

    Answering 1. (cont.)

    • How much faster would the machine be if a better data cache

    • reduced the average load time to 2 cycles?

    • Answer:

    • It is n times faster with:

    • n = CPU Time Before Imp. / CPU Time After Imp.

    • = Clock Cycle Time * CPI1 * Inst. Count /

    • Clock Cycle Time * CPI2 * Inst. Count

    • = CPI1 / CPI2 = 2.2 / 1.6 = 1.375

    • We Say:

    • CPU is 1.375 times faster, or

    • CPU is 37.50% faster


    Answering 2

    Answering 2.

    How does this compare with using branch prediction to shave a

    cycle off the branch time?

    Answer:

    “Shaving” a cycle off the branch time means CPI of branch

    is reduced by one cycle

    • Computing the CPI After Improvement:

      • Op Freq CPI(I) CPI(i) x Freq

      • ALU 50% 1 .5

      • Load 20% 5 1.0

      • Store 10% 3 .3

      • Branch 20% 1 .2

      • -----------

  • CPI2 = .5x1 + .2x5 + .1%x3 +.2x1= 2.0

  • Reducing the Load time produces better performances than

    reducing the branch time


    Answering 3

    Answering 3.

    What if two ALU instructions could be executed at once?

    Answer:

    Two instructions executed at once means:

    For one instruction, it takes virtually half the time to execute

    on machine B. So,

    CPI(i)B = CPI(i)A/2

    • Computing the CPI of Machine B

      • Op Freq CPI(i) CPI(I) x Freq

      • ALU 50% .5 .25

      • Load 20% 5 1.0

      • Store 10% 3 .3

      • Branch 20% 2 .4

      • -----------

  • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 1.95


  • Time evaluation

    Time % Evaluation

    How to determine which class of instructions takes the

    highest time ?

    • Evaluate Time Percentages of Instructions

    • Cannot be Directly Measured (Program has Mixed Instructions)

    • Need to be Computed Using CPI and Frequency


    Time evaluation1

    Time % Evaluation

    • Given:

    • Ic: Instruction Count

    • Ii: Instruction Count for Instruction Class i

    • Fi: Frequency of Instructions of Class i

    • Tc: Clock Cycle Time

    • CPIi: Clock Cycles/Instruction for Class i

    • CPI: Average Clock Cycles / Instruction for the whole program

    • Pi: Percentage of time for instruction of Class i

    CPUtime= CPI x Ic x Tc

    CPUtimei= CPIi x Ii x Tc

    Ii = Ic x Fi

    CPUtimei= CPIi x Ic x Fi x Tc

    Pi = CPUtimei / CPUtime

    Pi = CPIi x Ic x Fi x Tc / (CPI x Ic x Tc)

    CPIi x Fi

    CPI

    Pi =


    Amdahl s law

    Amdahl’s Law

    Speed-up due to Enhancement E


    Amdahl s law1

    Amdahl’s Law

    Speed-up due to Enhancement E

    Execution Time w/o E Performance w/ E

    Speedup = --------------------------------- = -----------------------------

    Execution Time w/ E Performance w/o E


    Amdahl s law2

    Amdahl’s Law

    Speed-up due to Enhancement E

    Execution Time w/o E Performance w/ E

    Speedup = --------------------------------- = -----------------------------

    Execution Time w/ E Performance w/o E

    Suppose that Enhancement E accelerate a portion F Only

    by a factor S

    TFE

    TA

    TFA

    TE


    Amdahl s law3

    Amdahl’s Law

    New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged

    TE = TA – TFA + TFETA – TFA is unchanged

    TFA = TA * FF is a fraction of TA

    TFE = TFA/S = TA * F/STime is reduced by a factor S

    TE = TA – TA*F + TA * F/S

    Means:


    Amdahl s law4

    Amdahl’s Law

    New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged

    TE = TA – TFA + TFETA – TFA is unchanged

    TFA = TA * FF is a fraction of TA

    TFE = TFA/S = TA * F/STime is reduced by a factor S

    TE = TA – TA*F + TA * F/S

    Means:

    1

    ------------------

    (1-F + (F/S))

    Speedup =

    TE = TA * (1 – F + (F/S))


    Benchmarks

    Benchmarks

    • Few users run same program over and over

    • Need Programs specially developed to compare performance

    • Best Reference:Real Application

    • Real Application NOT common to all users

    Benchmarks are Programs developed for the sole purpose of Performance Evaluation


    Typical workload

    Typical Workload


    Full application benchmark

    Full Application Benchmark


    Small benchmarks

    Small Benchmarks


    Spec95

    SPEC95

    • Eighteen application benchmarks (with inputs) reflecting a technical computing workload

    • Eight integer

      • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

    • Ten floating-point intensive

      • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5

    • Must run with standard compiler flags

      • eliminate special undocumented incantations that may not even generate working code for real programs


    Fallacies and pitfalls

    Fallacies and Pitfalls

    • Amdahl’s law sets limits only and is NOT unlimited

      • Improvement of one aspect cannot improve the overall performance by a factor proportional to the size of the improvement

    • Hardware-independent metrics DO NOT predict performance

      • Code size, Impl. of software systems

    • Using MIPS (Millions of Inst. Per Second) as a performance metric

      • Instructions have different CPI

      • MIPS metric vary from one program to the other on the SAME CPU.


  • Login