Master
Download
1 / 89

Master Program (Laurea Magistrale) in Computer Science and Networking - PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi. Instruction Level Parallelism 2.1. Pipelined CPU 2.2. More parallelism , data-flow model. Improving performance through parallelism.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Master Program (Laurea Magistrale) in Computer Science and Networking' - audrey-hoffman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Master Program (Laurea Magistrale) in Computer Science and Networking

High Performance ComputingSystems and EnablingPlatforms

Marco Vanneschi

  • InstructionLevelParallelism

  • 2.1. Pipelined CPU

  • 2.2. More parallelism, data-flow model


Improving performance through parallelism
Improving performance throughparallelism

Sequential Computer

  • The sequentialprogrammingassumption

    • User’s applications exploit the sequentialparadigm: technology pull or technologypush?

  • Sequentialcooperation P – MMU – Cache – MainMemory – …

    • Request-responsecooperation: low efficiency, limited performance

    • Caching: performance improvementthroughlatencyreduction(memoryaccesstime), evenwithsequentialcooperation

  • Further, big improvement: increasing CPU bandwidth, throughCPU-internalparallelism

    • InstructionLevelParallelism (ILP): severalinstructionsexecuted in parallel

    • Parallelismishiddento the programmer

      • The sequentialprogrammingassumptionstillholds

    • Parallelismisexploited at the firmwarelevel(parallelizationof the firmwareinterpreterofassemblermachine)

    • Compiler optimizationsof the sequentialprogram, exploiting the firmwareparallelism at best

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


Ilp which parallelism paradigm
ILP: whichparallelismparadigm?

All forms are feasible:

Pipelined implementation of the firmware interpreter

Replication of some parts of the firmware interpreter

Data-flow ordering of instructions

Vectorized instructions

Vectorized implementation of some parts of the firmware interpreter

Parallelismparadigms / parallelismforms

  • Streamparallelism

    • Pipeline

    • Farm

    • Data-flow

  • Data parallelism

    • Map

    • Stencil

  • Stream + data parallelism

    • Map / stencil operating on streams

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


Pipeline paradigm in general
Pipeline paradigm in general

Module operating on stream

M

  • x : y = F (x) = F2 (F1 (F0 (x) ) )

  • Tseq = TF = TF0 + TF1 + TF2

Stream of x

Stream of y

F0

F1

F2

Parallelism degree: n = 3

M0

M1

M2

Stream of x

Stream of y

T1 = TF1 + Tcom

T0 = TF0 + Tcom

T2 = TF2 + Tcom

Service Time: Tpipe = max (T0, T1, …, Tn-1) parallelismdegreenTpipe-id = Tseq / n

Efficiency: e = Tpipe-id/Tpipe = Tseq / (n max (T0, T1, …, Tn-1))

ifTF0 = TF1 = … = TFn-1 = t : Tpipe = t + Tcombalanced pipeline

ifTcom = 0 : Tpipe = Tpipe-id = Tseq / n fullyoverlappedcommunications

Forbalanced, fullyoverlapped pipeline structures:

processing bandwidth: Bpipe = Bpipe-id = n Bseq

efficiency: e = 1

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


Pipeline latency
Pipeline latency

Module operating on stream

M

  • x : y = F (x) = F2 (F1 (F0 (x) ) )

  • latencyLseq =Tseq = TF = TF0 + TF1 + TF2

Stream of x

Stream of y

F0

F1

F2

M0

M1

M2

Stream of x

Stream of y

L0 = TF0 + Lcom

L1 = TF1 + Lcom

L2 = TF2 + Lcom

Latency:

≥ Lseq

Bandwidth improvement

at the expense of increased latency wrt the sequential implementation.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


Pipeline completion time
Pipeline completiontime

Steady state period

m = stream length (= 2)

n = parallelism degree (= 2)

T (n)= service time with parallelism degree n

Filling transient period

Emptying transient period

If m >> n :

The fundamental relation between completion time and service time,

valid for “long streams” (e.g. instruction sequence of a program …)

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


Pipeline with internal state
Pipeline withinternal state

  • Computation with internal state: for each stream element, the output value depends also on an internal state variable, and the internal state is updated.

  • Partitioned internal state:

    if disjoint partitions of the state variable can be recognized

    and they are encapsulated into distinct stages

    and each stage operates on its own state partition only,

    then the presence of internal state has no impact on service time.

  • However, if state partitions are related each other, i.e. if there is a STATE CONSISTENCY PROBLEM, then some degradation of the service time exists.

Mi

Mi+1

Mi-1

Si-1

Si

Si+1

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


Pipelining and loop unfolding
Pipelining and loopunfolding

k-unfolding

M :: int A[m]; ints, s0;

{ < receive s0 from input_stream >;

s = s0 ;

for (i = 0; i < m; i++)

s = F (s, A[i]);

< send s onto output_stream >

}

k times

M :: int A[m]; int s;

{ < receive s0 from input_stream >;

s = s0;

for (i = 0; i < m/k; i++)

s = F (s, A[i]);

for (i = m/k; i < 2m/k; i++)

s = F (s, A[i]);

for (i = (k-1)m/k ; i < m; i++)

s = F (s, A[i]);

< send s onto output_stream >

}

k-stage

balanced pipeline with A partitioned (k partitions)

  • Formally, thistransformationimplements a formofstreamdata-parallelismwith stencil:

  • simple stencil (linearchain),

  • optimal service time,

  • butincreasedlatencycomparedtonon-pipelineddata-parallelstructures.

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


From sequential cpu
Fromsequential CPU …

M

Memory Interface (MINF)

I/O

MMU

C

CPU

chip

P

Interrupt arbiter

. . .

MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


To pipelined cpu
topipelined CPU

Parallelizationof the firmwareinterpreter

while (true) do

 instructionfetch (IC);

instructiondecoding;

(possible) operandfetch;

execution;

(possible) resultwriting;

IC update;

< interrupt handling: firmwarephase >

Generate stream

  • instructionstream

    Recognize pipeline stages

  • correspondingtointerpreterphases

    Loadbalanceofstages

  • Pipeline withinternal state

    • problemsof state consistency

    • “impure” pipeline

    • impact on performance

  • MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Basic principle of pipelined cpu simplified view

    3

    3

    3

    2

    2

    2

    1

    1

    1

    BasicprincipleofPipelined CPU: simplifiedview

    M

    becomes

    Instruction Memory (IM)

    InstructionUnit (IU)

    Data Memory (DM)

    ExecutionUnit (EU)

    P

    t

    Instruction stream generation: consecutive instruction addresses

    Instruction decode, data addresses generation

    Data accesses

    Execution phase

    IM

    IU

    DM

    2

    3

    1

    EU

    In principle, a continuous stream of instructions feeds the pipelined execution of the firmware interpreter;

    each pipeline stage corresponds to an interpreter phase.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Pipelined cpu an example
    Pipelined CPU: anexample

    M

    Bus I/O

    • IM: Instruction Memory

    • MMUI: Instruction Memory MMU

    • CI: Instruction Cache

    • TAB-CI: Instruction Cache Relocation

    • DM: DataMemory

    • MMUD: Data Memory MMU

    • CD: Data Cache

    • TAB-CD: Data Cache Relocation

    • IU: Instruction prepation Unit; possibly parallel

    • EU: instruction Execution Unit ; possibly parallel/pipelined

    • RG: General Registers, primary copy

    • RG1: General Registers, secondary copy

    • IC: program counter, primary copy

    • IC1: program counter, secondary copy

    MINF

    IC1

    MMUI

    MMUD

    TAB-CI

    TAB-CD

    IM

    DM

    CI

    CD

    IU

    EU

    IC

    RG1

    RG

    Int. Arb.

    CPU chip

    (primary cache only is shown for simplicity)

    . . .

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Main tasks of the cpu units simplified
    Maintasksof the CPU units (simplified)

    • IM : generates the instructionstream at consecutive addresses (IC1); if a newvalue IC1 isreceivedby IU, re-starts the stream generation startingfromthisvalue

      • IM service time = 1t + Tcom = 2t (Ttr= 0)

    • IU : receivesinstructionfrom IM; ifvalidthen, accordingtoinstructionclass:

      • ifarithmeticinstruction: sendsinstructionto EU

      • if LOAD: sendsinstructionto EU, computesoperandaddresson updatedregisters, sendsreadingrequestto DM

      • if STORE: sendsinstructionto EU, computesoperandaddresson updatedregisters, sendswritingrequestto DM (address and data)

      • if IF: verifiesbranchcondition on generalregisters; iftruethenupdates IC with the target address and sends IC valueto IM, otherwiseincrements IC

      • if GOTO: updates IC with the target address and sends IC valueto IM

      • IU service time = 1t + Tcom = 2t (Ttr= 0)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Main tasks of the cpu units simplified1
    Maintasksof the CPU units (simplified)

    • DM : receivesrequestfrom IU and executesit; if a readingoperationthensends the readvalueto EU

      • DM service time = 1t + Tcom = 2t (Ttr= 0)

  • EU : receivesrequestfrom IU and executesit:

    • ifarithmeticinstruction: executes the correspondingoperation, stores the resultsinto the destinationregister, and sendsitto IU

    • if LOAD: waitforoperandfrom DM, storesitinto the destinationregister, and sendsitto IU.

    • EU service time = 1t + Tcom = 2t (Ttr= 0); latencymaybe1t or greater

  • MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Amd opteron x4 barcelona
    AMD Opteron X4 (Barcelona)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Pipeline abstract architecture
    Pipeline abstractarchitecture

    • The CPU architecture “asseen” by the compiler

      • Itcaptures the essentialfeaturesfordefininganeffectivecostmodelof the CPU

    • First assumption: Pure Risc(e.g.D-RISC)

    • allstageshave the same service time: wellbalanced:

    • t = Tid

    • Ttr= 0: ift ≥ 2tTcom = 0

    • forvery fine grainoperations, allcommunications are fullyoverlappedtointernalcalculations

    IM

    DM

    IC1

    IU

    EU

    IC

    RG1

    RG

    The compiler can easilysimulate the programexecution on thisabstractarchitecture, or ananalyticalcostmodel can bederived.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Example 1

    3

    1

    2

    1

    1

    Example 1

    1. LOAD R1, R0, R4

    2. ADD R4, R5, R6

    3. …

    Execution simulation:

    to determine service time T and efficiencye

    t

    2 t

    Instruction sequences of this kind are able to fully exploit the pipelined CPU.

    In fact, in this case CPU behaves as a “pure” pipeline.

    No performance degradation.

    “Skipping” DM stage has no impact on service time, in this example.

    IM

    2

    3

    IU

    asynchronous com.

    DM

    1

    2

    EU

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Peak performance
    Peak performance

    • For t ≥ 2tTcom = 0

    • Assume: t = 2t forall processing units

      • including EU: arithmeticoperationshaveservice time 2t

      • floatingpointarithmetic: pipeline implementation (e.g., 4 or 8 stages …)

      • forintegerarithmetic: latency 2t (e.g., hardware multiply/divide)

      • however, thislatencydoesnotholdforfloatingpointarithmetic, e.g. 4 or 8 timesgreater

    • CPU Service time: T = Tid = t

    • Performance:

      Pid= 1/Tid = 1/2t = fclock/2 instructions/sec

      • e.g. fclock = 4 GH2, Pid= 2 GIPS

    • Meaning: aninstructionisfetchedevery 2t

      • i.e., a newinstructionisprocessedevery 2t

      • from the evaluationviewpoint, the burdenbetweenassembler and firmwareisvanishing!

      • marginaladvantageoffirmwareimplementation: microinstructionparallelism

    • Furtherimprovements (e.g., superscalar) willleadto

      Pid= w/t = w fclock , w ≥ 1

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Performance degradations
    Performance degradations

    • For some programs, the CPU isnotableto operate as a “pure” pipeline

    • Consistencyproblems on replicatedcopies:

      • RG in EU, RG1 in IU associatedsynchronizationmechanism

      • IC in IU, IC1 in IM associatedsynchronizationmechanism

    • In ordertoguaranteeconsistencythroughsynchronization, some “bubbles” are introduced in the pipeline flow

      • increased service time

    • Consistencyof RG, RG1: LOGICAL DEPENDENCIES ON GENERAL REGISTERS (“data hazards”)

    • Consistencyof IC, IC1: BRANCH DEGRADATIONS (“controlhazards or branchhazards”)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Example 2 branch degradation

    t

    4 t

    2

    L

    3

    1

    1

    1

    4

    4

    IM

    2

    3

    L

    IU

    DM

    1

    2

    EU

    Example 2: branchdegradation

    1. LOAD R1, R0, R4

    2. ADD R4, R5, R6

    3. GOTO L

    4. …

    “bubble”:

    IM performs a wastedfetch (instruction 4),

    instruction 4 mustbediscardedby IU,

    aboutonetimeintervaltiswasted and itpropagatesthrough the pipeline.

    Abstract architecture: exactly one time interval t is wasted (approximate evaluation of the concrete architecture)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Example 3 logical dependency

    t

    5 t

    IM

    3

    2

    IU

    3

    DM

    3

    1

    1

    1

    2

    1

    2

    EU

    Example 3: logicaldependency

    1. LOAD R1, R0, R4

    2. ADD R4, R5, R6

    3. STORE R2, R0, R6

    Logical dependency,

    induced by

    instruction 2

    on

    instruction 3;

    distance k =1

    Instruction 3: when processed by IU, it needs the updated value of RG[6] contents.

    Thus, IU must wait for the completion of execution of instruction 2 in EU.

    In this case, two time intervals t are wasted (approximate view of the abstract architecture).

    The impact of execution latency is significant (in this case, DM is involved because of the LOAD instruction) .

    If an arithmetic instruction were in place of LOAD: the bubble skinks to one time interval t.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Pipelined d risc cpu implementation issues
    Pipelined D-RISC CPU: implementationissues

    M

    Bus I/O

    Assume a D-RISC machine

    Verify the feasibility of the abstract architecture characteristic:

    all stages have the same service time = 2t.

    MINF

    IC1

    MMUI

    MMUD

    TAB-CI

    TAB-CD

    IM

    DM

    RG-RG1 consistency: integer semaphore mechanism associated to RG1.

    IC-IC1 consistency: unique identifier associated to instructions from IM,

    in order to accept or to discard them in IU.

    CI

    CD

    IU

    EU

    IC

    RG1

    RG

    Int. Arb.

    CPU chip

    . . .

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Rg1 consistency
    RG1 consistency

    • Each General Register RG1[i] has associated a non-negative integer semaphore S[i]

    • For each arithmetic/LOAD instruction having RG[i] as destination, IU incrementsS[i]

    • Each time RG1[i] is updated with a value sent from EU, IU decrements S[i]

    • For each instruction that needs RG[i] to be read by IU (address components for LOAD, address components and source value for STORE, condition applied to registers for IF, address in register for GOTO), IU waits until condition (S[i] = 0) holds

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Ic1 consistency
    IC1 consistency

    • Each time IU executes a branch or jump, IC is updated with the new target address, and this value is sent to IM

    • As soon as it is received by IU, IM writes the new value into IC1

    • Every instruction sent to IU is accompanied by the IC1 value

    • For each received instruction, IU compares the value of IC with the received value of IC1: the instruction is valid if they are equal, otherwise the instruction is discarded.

    • Notice that IC (IC1) acts as a unique identifier. Alternatively, a shorter identifier can be generated by IU and sent to IM.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Detailed tasks of the cpu units
    Detailedtasksof the CPU units

    • IM

      • MMUI: translates IC1 into physical address IA of instruction, sends (IA, IC1) to TAB-CI, and increments IC1

        • when a new message is received from IU, MMUI updates IC1

      • TAB-CI : translates IA into cache address CIA of instruction, or generates a cache fault condition MISS), and sends (CIA, IC1, MISS) to CI

      • CI : if MISS is false, reads instruction INSTR at address CIA and sends (INSTR, IC1) to IU; otherwise performs the block transfer, then reads instruction INSTR at address CIA and sends (INSTR, IC1) to IU

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Detailed tasks of the cpu units1
    Detailedtasksof the CPU units

    • IU

      • whenreceiving(regaddr, val) from EU writesvalinto RG[regaddr] and decrements S[regaddr]

        or

      • whenreceiving (INSTR, IC1), if IC = IC1 the instructionisvalid and IU goes on, otherwisediscards the instruction

      • iftype = arithmetic(COP, Ra, Rb, Rc), sendsinstructionto EU, increments S[c]

      • iftype = LOAD (COP, Ra, Rb, Rc), waitsforupdatedvaluesof RG[a] and RG[b], sendsrequest (read, RG[a] + RG[b]) to DM, increments S[c]

      • iftype = STORE (COP, Ra, Rb, Rc), waitsforupdatedvaluesof RG[a], RG[b] and RG[c], sendsrequest (read, RG[a] + RG[b], RG[c]) to DM

      • iftype = IF (COP, Ra, Rb, OFFSET), waitsforupdatedvaluesof RG[a] and RG[b], ifbranchconditionistruethen IC = IC + OFFSET and sends the new IC to IM, otherwise IC = IC +1

      • iftype = GOTO (COP, OFFSET), IC = IC + OFFSET and sends the new IC to IM

      • iftype = GOTO (COP, Ra), waitsforupdatedvalueof RG[a], IC = RG[a] and sends the new IC to IM

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Detailed tasks of the cpu units2
    Detailedtasksof the CPU units

    • DM

      • MMUD: receives (op, log_addr, data) from IU, translateslog_addrintophys_addr, sends (op, phys_addr, data) toTAB_CD

      • TAB-CD : translatesphys_addrinto cache address CDA, or generates a cache fault condition MISS), and sends (op, CDA, data, MISS) toCD

      • CD: if MISS is false, thenaccordingtoop:

        • reads VAL at address CDA and sends VAL to EU;

        • otherwisewritesdata at address CDA;

          if MISS istrue: performs block transfer; completes the requestedoperationasbefore

    • EU

      • receivesinstruction (COP, Ra, Rb, Rc) from IU

      • iftype = arithmetic(COP, Ra, Rb, Rc), executes RG[c] = COP(RG[a], RG[b]), sends the new RG[c] to IU

      • iftype = LOAD (COP, Ra, Rb, Rc), waits VAL from DM, RG[c] = VAL, sends VAL to IU

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Cost model
    Costmodel

    • Tobeevaluated on the abstractarchitecture: goodapproximation

    • Branchdegradation:

      l = branchprobability

      T = l 2t + (1 – l) t = (1 + l) t

    • Logicaldependenciesdegradation:

      T = (1 + l) t + D

      D = averagedelaytime in IU processing

    • Efficiency:

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Example
    Example

    • 1. LOAD R1, R0, R4

    • 2. ADD R4, R5, R6

    • 3. GOTO L

    • = 0

    • = 1/3

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Logical dependencies degradation
    Logicaldependenciesdegradation

    • Formally, D is the response time RQ in a client-server system

    • Server: DM-EU subsystem

    • A logical queue of instructions is formed at the input of DM-EU server

    • Clients: logical instances of the IM-IU subsystem,

      corresponding to queued instructions sent to DM-EU

    • Let d = average probability of logical dependencies

      D= d RQ

    • How to evaluate RQ?

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Client server systems in general at any level

    C1

    S

    . . .

    Q

    Cn

    Client – server systems in general (at anylevel)

    • Assume a request-responsebehaviour:

      • a client sends a requestto the server,

      • thenwaitsfor the responsefrom the server.

    • The service timeofeach client isdelayedby the RESPONSE TIME of the server.

    • Modeling and evaluationas a Queueing System.

    Ci::

    initialize x …

    while (true) do

    {< send x to S >;

    < receive y from S >;

    x = G (y, …)

    }

    S::

    while (true) do

    { < receive x from C C1, …, Cn >;

    y = F (x, …) ;

    < send y to C >

    }

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Client server queueing modeling
    Client-serverqueueingmodeling

    Tcl: average service time of generic client

    TG: ideal average service time of generic client

    RQ: average response time of server

    r: server queue utilization factor

    TS: average service time of server

    LS: average latency time of server

    sS: other service time distribution parameters (e.g., variance of the service time of the server)

    TA: average interarrival time to server

    n: number of clients (assumed identical)

    System of equations to be solved through analytical or numerical methods.

    One and only one real solution exists satisfying the constraint: r < 1.

    This methodology will be applied several times during the course.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Typical results server utilization degree r
    Typicalresults: server utilizationdegree(= r)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Typical results client efficiency t g t cl
    Typicalresults: client efficiency(= TG/Tcl)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Parallel server the impact of service time and of latency
    Parallel server: the impact of service time and oflatency

    • A more significant way to express the Response Time, especially for a parallel server:

      RQ = WQ (r, …) + LS

    • WQ: average Waiting Time in queue (queue of clients requests): WQ depends on the utilization factor r, thus on the service time only (noton latency)

    • LSis the latency of server

    • Separation of the impact of service time and of latency

    • All the parallelism paradigms aim to reduce the service time

    • Some parallelism paradigms are ableto reduce the latency too

      • Data-parallel, data-flow, …

        while other paradigms increase the latency

      • Pipeline, farm, …

    • Overall effect: in which cases does the service time (latency) impact dominates on the latency (service time) impact ?

    • For each server-parallelization problem, the relative impact of service time and latency must be carefully studied in order to find a satisfactory trade-off.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Pipeline cpu approximate evaluation of d
    Pipeline CPU: approximateevaluationofD

    • For a D-RISC machine

    • Evaluation on the abstractarchitecture

    • Let:

      • k = distanceof a logicaldependency

      • dk= probabilityof a logicaldependencywithdistancek

      • NQ= averagenumberofinstructionswaiting in DM-EU subsystem

    • It can beshownthat:

      wheresummationisextendedtonon-negativetermsonly.

      NQ = 2 if the instructioninducing the logicaldependencyis the last instructionof a sequencecontainingallEU-executableinstructionsand at leastone LOAD,

      otherwiseNQ = 1 .

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    D formula for d risc proof
    D formula for D-RISC: proof

    • D = d1D1 + d2D2 + … extendedtonon-negativetermsonly

    • dk = probabilityoflogicaldependencywithdistancek

    • Dk = RQk= WQk + LS

    • WQk = averagenumberofqueuedinstructionsthat (withprobabilitydk) precede the instructioninducing the logicaldependency

    • LS = Latencyof EU in D-RISC; withhardware-implementedarithmeticoperations on integers LS = t

    • Dk = WQk + t

    • Problem: find a generalexpressionforWQk

    • The numberofdistinctsituationsthat can occur in a D-RISC abstractarchitectureisverylimited

    • Thus, the proof can bedonebyenumeration.

    • Letus individuate the distinctsituations.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    D formula for d risc proof1
    D formula for D-RISC: proof

    • Case A:

    • ADD …, …, R1

    • STORE …, …, R1

    • Case B:

    • ADD …, …, R1

    • SUB …, …, R2

    • STORE …, …, R1

    • Case D:

    • LOAD …, …, …

    • ADD …, …, R1

    • SUB …, …, R2

    • STORE …, …, R1

    • Case C:

    • LOAD …, …, …

    • ADD …, …, R1

    • STORE …, …, R1

    distance k = 1

    probability d1

    distance k = 2

    probability d2

    distance k = 2

    probability d2

    distance k = 1

    probability d1

    WQ = 1t:

    when 3 is in IU, 1 is in EU (current service), 2 is waiting in queue

    WQ = 1t:

    when 3 is in IU, 1 has been already executed

    WQ = 0:

    when 4 is in IU, 2 is in EU (current service)

    WQ = 0:

    when 2 is in IU, 1 is in EU (current service)

    D= d2 (WQ + Ls) = 0

    D = d2 (WQ + Ls) = d2 t

    D = d1 (WQ + Ls) = d1 t

    D = d1 (WQ + Ls) = 2 d1 t

    In all other cases D = 0, in particular for k > 2.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    D formula for d risc proof2
    D formula for D-RISC: proof

    • WQk = numberofqueuedinstructionspreceding the instructionthatinduces the logicaldependency

    • MaximumvalueofWQk = 1t:

      • the instructioncurrentlyexecuted in EU hasbeendelayedby a previous LOAD, and the instructioninducing the logicaldependencyisstill in queue

    • Otherwise: WQk = 0 or WQk = - 1t

      • WQk = 0: the instructioncurrentlyexecuted in EU induces the logicaldependency,

      • WQk = 0: the instructioncurrentlyexecuted in EU hasbeendelayedby a previous LOAD, anditisthe instructioninducing the logicaldependency

      • WQk = -1t: the instructioninducing the logicaldependencyhasbeenalreadyexecuted

    • Tobeproventhat: WQk = (NQk – k)t

      • NQk = 2 if the instructioninducing the logicaldependencyis the last instructionof a sequencecontainingallEU-executableinstructionsand at leastone LOAD,

      • otherwiseNQk = 1 .

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    D formula for d risc proof3
    D formula for D-RISC: proof

    • Case A:

    • ADD …, …, R1

    • STORE …, …, R1

    • Case D:

    • LOAD …, …, …

    • ADD …, …, R1

    • SUB …, …, R2

    • STORE …, …, R1

    • Case C:

    • LOAD …, …, …

    • ADD …, …, R1

    • STORE …, …, R1

    • Case B:

    • ADD …, …, R1

    • SUB …, …, R2

    • STORE …, …, R1

    distance k = 2

    probability d2

    distance k = 2

    probability d2

    distance k = 1

    probability d1

    distance k = 1

    probability d1

    WQ = 0:

    when 4 is in IU, 2 is in EU (current service)

    WQ = 1t:

    when 3 is in IU, 1 is in EU (current service), 2 is waiting in queue

    WQ = 1t:

    when 3 is in IU, 1 has been already executed

    WQ = 0:

    when 2 is in IU, 1 is in EU (current service)

    D = d2 (WQ + Ls) = 0

    D = d1 (WQ + Ls) = 2 d1 t

    D = d2 (WQ + Ls) = d2 t

    D = d1 (WQ + Ls) = d1 t

    NQ = 1,WQ = (NQ–k)t = 0

    NQ = 1,WQ = (NQ–k)t = 1t

    NQ = 2,WQ = (NQ–k)t = 1t

    NQ = 2,WQ = (NQ–k)t = 0

    In all other cases D = 0 : k > 2, or k = 2 and NQ = 1

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Example1
    Example

    1. LOAD R1, R0, R4

    2. ADD R4, R5, R6

    • STORE R2, R0, R6

      l = 0

      Onelogicaldependencyinducedbyinstruction 2 on instruction 3:

      distance k = 1, probability d1 = 1/3, NQ = 2

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Example2
    Example

    1. L : LOAD R1, R0, R4

    2. LOAD R2, R0, R5

    3. ADD R4, R5, R4

    4. STORE R3, R0, R4

    5. INCR R0

    6. IF < R0, R6, L

    7. END

    Two logical dependencies with the same distance (= 1): the one with NQ = 2, the other with NQ = 1

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Note intermixed nested logical dependencies
    Note: intermixed/nestedlogicaldependencies

    Logical dependency induced by 5 on 7 has no effect.

    Logical dependency induced by 4 on 7 has no effect.

    5. 4.

    6. 5.

    7. 6.

    7.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Cache faults
    Cache faults

    • All the performance evaluations must be corrected taking into account also the effects of cache faults (Instruction Cache, Data Cache) on completion time.

    • The added penalty for block transfer (data)

      • Worst-case: exactly the same penalty studied for sequential CPUs (See Cache Prerequisites),

      • Best-case: no penalty - fully overlapped to instruction pipelining

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Compiler optimizations
    Compiler optimizations

    • Minimize the impact of branch degradations:

      • DELAYED BRANCH

        = try to fill the branch bubbles with useful instructions

      • Annotated instructions

    • Minimize the impact of logical dependencies degradations:

      • CODE MOTION

        = try to increase the distance of logical dependencies

    • Preserve program semantics, i.e. transformations that satisfy the Bernstein conditions for correct parallelization.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Bernstein conditions
    Bernstein conditions

    Given the sequential computation:

    R1 = f1(D1) ; R2 = f2(D2)

    it can be transformed into the equivalentparallel computation:

    R1 = f1(D1) R2 = f2(D2)

    if all the following conditions hold:

    R1 D2 = (true dependency)

    R1 R2 =  (antidependency)

    D1 R2 = (output dependency)

    Notice: for the parallelism inside a microinstruction (seeFirmwarePrerequisites), only the first and secondconditions are sufficient (synchronousmodelofcomputation). E.g.

    A + B  C ; D + E  A isequivalentto: A + B  C , D + E  A

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Delayed branch example
    Delayedbranch: example

    1. L : LOAD R1, R0, R4

    2. LOAD R2, R0, R5

    3. ADD R4, R5, R4

    4. STORE R3, R0, R4

    5. INCR R0

    6. IF < R0, R6, L

    7. END

    l = 0

    1. LOAD R1, R0, R4

    2. L : LOAD R2, R0, R5

    3. ADD R4, R5, R4

    4. STORE R3, R0, R4

    5. INCR R0

    6. IF < R0, R6, L, delayed_branch

    7. LOAD R1, R0, R4

    8. END

    (previous version: e = 3/5)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Increase logical dependency distance example
    Increaselogicaldependencydistance: example

    1. LOAD R1, R0, R4

    2. L : LOAD R2, R0, R5

    3. ADD R4, R5, R4

    4. STORE R3, R0, R4

    5. INCR R0

    6. IF < R0, R6, L, delayed_branch

    7. LOAD R1, R0, R4

    8. END

    For this transformation, the compiler initializes register R3 at the value

    Base_C - 1

    where Base_C is the base address of the result array.

    d1= 1/6

    1. LOAD R1, R0, R4

    2. L : LOAD R2, R0, R5

    3. ADD R4, R5, R4

    4. INCR R0

    5. STORE R3, R0, R4

    6. IF < R0, R6, L, delayed_branch

    7. LOAD R1, R0, R4

    8. END

    (previous version: e = 2/3)

    Tc = 6 N T = 8 N t

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exploit both optimizations jointly
    Exploit bothoptimizationsjointly

    1. LOAD R1, R0, R4

    2. L : LOAD R2, R0, R5

    3. ADD R4, R5, R4

    4. STORE R3, R0, R4

    5. INCR R0

    6. IF < R0, R6, L, delayed_branch

    8. END

    1. L : LOAD R1, R0, R4

    2. LOAD R2, R0, R5

    3. INCR R0

    4. ADD R4, R5, R4

    6. IF < R0, R6, L, delayed_branch

    6. STORE R3, R0, R4

    7. END

    (previous version: e = 3/4)

    Tc = 6 N T = 7 N t

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exercizes
    Exercizes

    • Compile, with optimizations, and evaluate the completion time of a program that computes the matrix-vector product (operands: int A[M][M], int B[M]; result: int C[M], where M = 104) on a D-RISC Pipeline machine.

      The evaluation must include the cache fault effects, assuming the Data Cache of 64K words capacity, 8 words blocks, operating on-demand.

    • Consider the curves at pages 33, 34. Explain their shapes in a qualitative way (e.g. why e tends asymtotically to an upper bound value as TG tends to infinity, …).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exercize
    Exercize

    • Systems S1 and S2 have a Pipelined CPU architecture. S1 is a D-RISC machine. The assembler level of S2 is D-RISC enriched with the following instruction:

      REDUCE_SUM RA, RN, Rx

      with RG[RA] = base address of an integer array A[N], and RG[RN] = N.

      The semantics of this instruction is

      RG[Rx] = reduce (A[N], +)

      Remember that

      reduce (H[N], )

      where  is any associative operator, is a second-order function whose result is equal to the scalar value

      H[0]  H[1] …  H[N-1]

      • Explain how a reduce (A[N], +) is compiled on S1 and how it is implemented on S2, and evaluate the difference d in the completion times of reduce on S1 and on S2.

      • A certain program includes reduce (A[N], +) executed one and only one time. Explain why the difference of the completion times of this program on S1 and on S2 could be different compared to d as evaluated in point a).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Execution unit and long instructions
    ExecutionUnit and “long” instructions

    • The EU implementationof “long” arithmeticoperations, e.g.

      • integermultiplication, division

      • floatingpointaddition

      • floatingpointmultiplication, division

      • otherfloatingpointoperations (squareroot, sin, cosin, …)

        hastorespect some basicguidelines:

      • tomantain the EU service timeequalto the CPU ideal service timeTid,

      • tominimize the EU latency in ordertominimize the logicaldependenciesdegradation.

    • To deal withissues A, B we can apply:

      • Parallelparadigms at the firmwarelevel

      • Data-parallelparadigms and special hardware implementationsofarithmeticoperations.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Integer multiplication division
    Integermultiplication, division

    • A simplefirmwareimplementationis a loopalgorithm (similarto the familiaroperation “byhand”), explotingadditions/subtractions and shifts

      • Latency = O(word lenght) t  50 t (32 bit integers)

    • Hardwareimplementations in 1-2 t exist, thoughexpensive in termsofsilicon area ifappliedto the entire word

    • Intermediate solution: firmwareloopalgorithm, in whichfewiterations are appliedtopartsof the word (e.g. byte), eachiterationimplemented in hardware.

    • Intermediate solutionsexploitingparallelparadigms:

      • farmoffirmware multiply-&-divide units: reduced service time, butnotlatency

      • loop-unfolding pipeline, withstagescorrespondingtoparts (byte) implemented in hardware: reduced service time, latencyas in the sequentialversion (intermediate).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Multiplication through loop unfolding pipeline
    Multiplicationthroughloop-unfolding pipeline

    Multiply a, b, result c

    Let byte (v, j) denote the j-th byte of integer word v

    Sequential multiplication algorithm (32 bit) exploiting the hardware implementation applied to bytes:

    init c;

    for (i = 0; i < 4; i++)

    c = combine ( c, hardware_multiplication ( byte (a, i), byte (b, i) ) )

    Loop-unfolding pipeline (“systolic”)implementation:

    byte (a, 0)

    byte (b, 0)

    byte (a, 1)

    byte (b, 1)

    byte (a, 2)

    byte (b, 2)

    byte (a, 3)

    byte (b, 3)

    hw_mul

    & combine

    hw_mul

    & combine

    hw_mul

    & combine

    hw_mul

    & combine

    c

    c

    service time = 2t, latency = 8t

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exercize1
    Exercize

    • The cost model formula for D (given previously) is valid for a D-RISC-like CPU in which the integer multiplication/division operations are implemented entirely in hardware (1- 2 clock cycles).

      Modify this formula for a Pipelined CPU architecture in which the integer multiplication/division operations are implemened as a 4-stage pipeline.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Pipelined floating point unit
    Pipelinedfloatingpointunit

    Example: FP addition

    FP number = (m, e), where: m = mantissa, e = exponent

    N1 = (m1, e1)

    N2 = (m2, e2)

    D = e2 – e1

    m 1 = m1  2-D, m2 = m2

    or

    m 1 = m1, m2 = m2  2-D

    m = m1 + m2

    e = max (e1, e2)

    normalization

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Parallel eu schemes
    Parallel EU schemes

    • Farm, whereeachworkerisgeneral-purpose, implementingaddition/subtraction & multiply/divide on integer or floatingpointnumbers.

    • Functionalpartitioning, whereeachworkerimplements a specificoperation on integer or floatingpointnumbers:

    INT Add / Sub

    INT Add / Sub

    FPPipelined

    Add / Sub

    collector

    dispatcher

    INTPipelined

    Mul / Div

    FPPipelined

    Mul / Div

    Floating Point Registers

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Parallel eu additional logical dependencies
    Parallel EU: additionallogicaldependencies

    • More arithmeticinstructions can beexecutedsimultaneoulsy in a parallel EU.

    • Thisintroducesadditionallogicaldependencies “EU-EU” (tillnow: “IU-EU” only). Forexample, in the sequence:

      MUL Ra, Rb, Rc

      ADD Rd, Re, Rf

      the secondinstructionstarts in EU assoon the first enters the pipelinedmultiplier, while in the sequence:

      MUL Ra, Rb, Rc

      ADD Rc, Re, Rf

      the secondinstructionisblocked in EU until the first iscompleted.

      The dispatcherunits (seeprevious slide) is in chargeofimplementing the synchronizationmechanism, accordingto a semaphorictechniquesimilarto the oneforsynchronizing IU and EU.

    • Becauseof EU latency and EU-EU dependencies, itis more convenienttoincrease the asynchronydegreeof the IU-EU channel, i.e. a QueueingUnitisinserted. In practice, thisincreases the numberof pipeline stagesof the machine.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Data parallelism for arithmetic operations
    Data parallelismforarithmeticoperations

    • Data parallelismismeaningful on large data structures, i.e. arrays, only

    • Vectorizedassemblerinstructions operate on arrays, e.g.

      SUM_VECTbase_address A, base_address B, base_address C, size N

    • Data parallelimplementation: map at the firmwarelevel

      • scatterarrays A, B

      • mapexecution

      • gathertoobtainarray C

    • Also: reduce, parallelprefixoperations

    • Also: operationsrequiringstencil-baseddata parallelimplementation, e.g. matrix-vector or matrix-matrixproduct, convolution, filters, Fast Fourier Transform.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Pure risc enriched by floating point pipelined and vectorized co processors
    Pure Riscenrichedbyfloatingpointpipelined and vectorizedco-processors

    • Mathematicalco-processors, connectedasI/O units

      • insteadofassemblerinstructions

    • FP / vectoroperationscalledaslibrariescontaining I/O transfer operations (LOAD, STORE - MemoryMapped I/O - DMA)

    Cache Memory

    Cache

    Memory

    Main Memory

    DMA Bus

    ...

    CPU

    Memory Management Unit

    I/O Bus

    I/O1

    I/Ok

    ...

    co-processors

    .

    .

    .

    Processor

    Interrupt Arbiter

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solutions of exercizes
    SolutionsofExercizes

    • Exercize 1

    • Exercize 3

    • Exercize 4

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exercize 1
    Exercize 1

    Compile, withoptimizations, and evaluate the completiontimeof a programthatcomputes the matrix-vectorproduct (operands: int A[M][M], int B[M]; result: int C[M], where M = 104) on a D-RISC Pipeline machine.

    The evaluationmust include the cache fault effects, assuming the Data Cache of 64K wordscapacity, 8 wordsblocks, operatingon-demand.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 1
    SolutionofExercize 1

    Sequentialalgorithm, O(M2):

    int A[M][M]; int B[M]; int C[M];

    for (i = 0; i < M; i++)

    C[i] = 0;

    for (j = 0; j < M; j++)

    C[i] = C[i] + A[i][j]  B[j]

    Processvirtualmemory:

    • matrix A storedbyrow, M2 consecutive words;

    • compilation rulefor 2-dimension matricesused in loops: foreachiteration,

      base address = base address + M

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 11
    SolutionofExercize 1

    Basic compilation (withoutoptimizations):

    initialize RA, RB, RC, RM, Ri (=0); allocate Rj, Ra, Rb, Rc

    LOOP_i: CLEAR Rc

    CLEAR Rj

    LOOP_j: LOAD RA, Rj, Ra

    LOAD RB, Rj, Rb

    MUL Ra, Rb, Ra

    ADD Rc, Ra, Rc

    INCR Rj

    IF < Rj, RM, LOOP_j

    STORE RC, Ri, Rc

    ADD RA, RM, RA

    INCR Ri

    IF < Ri, RM, LOOP_i

    END

    Completion time:

    Tc M2  completion time of innermost loop

    Tc 6 M2 T

    where T = service time per instruction

    Performance degradations in the outermost loop are not relevant (however, the compiler applies optimizations to this part too).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 12
    SolutionofExercize 1

    Analysis of innermost loop (initially: perfect cache):

    l = 1/6; k = 1, dk = 1/6, NQk = 2

    T = (1 + l) t + dk t (NQk + 1 – k) = 9t/6 e = 6/9

    Tc 6 M2 T = 9 M2 t = 18 M2t

    LOOP_j: LOAD RA, Rj, Ra

    LOAD RB, Rj, Rb

    MUL Ra, Rb, Ra

    ADD Rc, Ra, Rc

    INCR Rj

    IF < Rj, RM, LOOP_j

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 13
    SolutionofExercize 1

    Optimization and evaluation of innermost loop (perfect cache):

    l = 0; k = 3  D = 0

    T = t = Tide = 1

    Tc 6 M2 T = 6 M2 t = 12 M2t

    Completion time difference = 6 M2t, ratio = 1.5

    LOAD RA, Rj, Ra

    LOOP_j: LOAD RB, Rj, Rb

    INCR Rj

    MUL Ra, Rb, Ra

    ADD Rc, Ra, Rc

    IF < Rj, RM, LOOP_j, delayed_branch

    LOAD RA, Rj, Ra

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 14
    SolutionofExercize 1

    Impact ofcaching:

    Instructionfaults: notrelevant; reuse ; 2 fixedblocks in working set

    Matrix A, array C: localityonly ; 1 block at the timefor A and for C in working set

    Nfault-A = M2/s, Nfault-C = M/s

    Array B: reuse (+ locality) ; all B blocks in working set; due toitssize, B can beallocatedentirely in cache

    Nfault-B = M/s

    LOAD RB, Rj, Rb, no_dellocation

    Assuming a secondary cache on chip: Tblock = s t

    Tfault = NfaultTblock Nfault-ATblock= M2t

    Worst–caseassumption– pipeline isblockedwhen a block transfer occurs:

    Tc = Tc-perfect-cache + Tfault = 12 M2t + M2t = 13 M2t ecache = 12/13

    In this case, ifprefetchingwereapplicable, Nfault-A = 0 and TC = Tc-perfect-cache , ecache = 1.

    LOAD RA, Rj, Ra, prefetching

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 15
    SolutionofExercize 1

    Best-caseassumptionabout block transfer corresponds a more realisticbehaviuor: pipeline isworkingnormallyduring a block transfer; onlyif a data dependency on the block valueoccurs, then a bubble in naturallyintroduced.

    In our case:

    Withsecondary cache on chip, and s = 8, withtheseassumptions the block transfer isoverlappedto pipeline behaviour:

    • when EU isready the execute MUL, the A-block transfer hasbeencompleted in DM,

    • no (relevant) cache degradation, evenwithoutrelaying on prefetchingof A blocks.

      Forout-of-orderbehaviour: see the slidesafterBranchPrediction.

    • DuringanA-block transfer,

    • DM servesotherrequests in parallel(B is in cache),

    • LOAD RB, … can becorrectlyexecutedout-of-order

    • INCR Rj can beexecuted,

    • MUL can start; EU waitsfor the elementof A

    LOAD RA, Rj, Ra

    LOOP_j: LOAD RB, Rj, Rb

    INCR Rj

    MUL Ra, Rb, Ra

    ADD Rc, Ra, Rc

    IF < Rj, RM, LOOP_j, delayed_branch

    LOAD RA, Rj, Ra

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exercize 3
    Exercize 3

    Systems S1 and S2 have a Pipelined CPU architecture. S1 is a D-RISC machine. The assemblerlevelof S2 is D-RISC enrichedwith the followinginstruction:

    REDUCE_SUM RA, RN, Rx

    with RG[RA] = base addressofanintegerarray A[N], and RG[RN] = N.

    The semanticsofthisinstructionis

    RG[Rx] = reduce (A[N], +)

    Rememberthat

    reduce (H[N], )

    where isany associative operator,is a second-orderfunctionwhoseresultisequalto the scalar value

    H[0]  H[1] …  H[N-1]

    • Explainhow a reduce (A[N], +)iscompiled on S1 and howitisimplemented on S2, and evaluate the differenced in the completiontimesofreduce on S1 and on S2.

    • A certainprogramincludesreduce (A[N], +)executedone and onlyonetime. Explainwhy the differenceof the completiontimesofthisprogram on S1 and on S2 couldbedifferentcomparedtodasevaluated in point a).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 3
    SolutionofExercize 3

    a) System S1. The sequentialalgorithmfor

    x = reduce (A[N], +)

    havingcomplexityO(N), can becompiledinto the followingoptimized code for :

    LOAD RA, Ri, Ra

    LOOP: INCR Ri

    ADD Rx, Ra, Rx

    IF < Ri, RN, LOOP, delayed_branch

    LOAD RA, Ri, Ra

    Performance analysis, withoutconsideringpossible cache degradations:

    l = 0; k = 2, dk = 1/4, NQk = 2

    T = (1 + l) t + dkt (NQk + 1 – k) = 5t/4 e = 4/5

    Tc1 4 N T = 5 N t = 10 N t

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 31
    SolutionofExercize 3

    System S2. The firmware interpreter of instruction REDUCE_SUM RA, RN, Rx on a pipeline CPU can be the following:

    IM microprogram: the same of S1.

    IU microprogram for REDUCE: send instruction encoding to EU, produce a stream of N read-requests to DM with consecutive addresses beginning at RG[RA].

    DM microprogram: the same of S1.

    EU microprogram for REDUCE: initialize RG[Rx] at zero; loop for N times: wait data d from DM, RG[Rx] + d  RG[Rx]; on completion, send RG[Rx] content to IU.

    Alternatively: IU send to DM just one request (multiple_read, N, base address RG[RA]), and DM produces a stream of consecutive-address data to EU (there is no advantage).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 32
    SolutionofExercize 3

    Considering the graphicsimulationofthis S2 interpreter, weobtain the latencyof REDUCE instruction on S2, whichis the completiontimeof a S2 programconsistingof REDUCE only (see the general formula for the completiontimeof the pipeline paradigm):

    Tc2  N t = 2 N t

    Noticethatthere are no logicaldependencynorbranchdegradations.

    The completiontimeratioof the twoimplementations on S1 and S2 isequalto 5, and the difference

    d = 8 N t

    Thisexampleconfirms the analysisof the comparisonofassemblervsfirmwareimplementationof the samefunctionality(seeExercize in prerequisites): the difference in latencyismainly due to the parallelism in microinstructionswrtsequentialexecution at the assemblerlevel.

    In the example, thisparallelismisrelevant in the IU and EU behaviour.

    Moreover, the degradations are oftenreduced, because some computationsoccurslocally in some units (EU in our case) insteadofrequiring the interactionsofdistinctunitswithconsistencysynchronizationproblems (in our case, only the finalresultiscommunicatedfrom EU to IU).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 33
    SolutionofExercize 3

    b) Consider a S1 program containing the REDUCE routine and executing it just one time. The equivalent S2 program contains the REDUCE instruction in place of the REDUCE routine.

    The differenze of completion times is less or equal than the d value of part a):

    • In S1, the compiler has more degrees of freedom to introduce optimizations concerning the code sections before and after the REDUCE routine:

      • the REDUCE loop can be, at least partially, unfolded,

      • the compiler can try to introduce some code motions, in order tomix some instructions of the code sections before and after the REDUCE routine with the instructions of the REDUCE routine itself.

        This example shows the potentials for compiler optimizations of RISC systems compared to CISC: in some cases, such optimizations are able to compensate the RISC latency disadvantage, at least partially.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Exercize 4
    Exercize 4

    The costmodel formula forD (givenpreviously) isvalidfor a D-RISC-like CPU in which the integermultiplication/divisionoperations are implementedentirely in hardware (1- 2 clock cycles).

    Modifythis formula for a Pipelined CPU architecture in which the integermultiplication/divisionoperations are implemenedas a 4-stage pipeline.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 4
    SolutionofExercize 4

    Cost model for pipelined D-RISC CPU:

    • where

    • NQ = 2 if the instructioninducing the logicaldependencyis the last instructionof a sequencecontainingallEU-executableinstructionsand at leastone LOAD,

    • otherwise NQ = 1

    • summationisextendedtonon-negativetermsonly. In D-RISC, for k ≥ 3, D = 0.

    • The meaningisthateachDk can beexpressedas

    • Dk = dkRQkRQk= WQk + LS

    • WQk = NQk – k = averagenumberofqueuedinstructionsthat (withprobabilitydk) precede the instructioninducing the logicaldependency

    • LS = Latencyof EU. Withhardware-implementedarithmeticoperations on integers:

    • LS = t

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Solution of exercize 41
    SolutionofExercize 4

    A 4-stage pipelined MUL/DIV operatorhas a latency:

    Ls-MUL/DIV = 4 t

    Ifp = prob(istructionis MUL or DIV), and all the otherarithmeticoperations are “short”, the average EU latencyis:

    Ls= (1 + 3p) t

    Then (the proof can beverifiedusing the graphicalsimulation):

    This result can be generalized to any value of EU latency, for D-RISC like machines. Moreover, it is valid also for EU-EU dependencies.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Branch prediction
    Branchprediction

    • For high-latency pipelined EU implementations, the effect on logical dependencies degradation is significant

    • When logical dependencies are applied to predicate evaluation in branch instructions, e.g.

      the so-called Branch Prediction technique can be applied.

    MUL Ra, Rb, Rc

    IF = Rc, Rd, CONT

    …..

    ….. false predicate path

    …..

    CONT: …..

    ….. true predicate path

    …..

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Branch prediction1
    Branchprediction

    • Tryto continue along the pathcorrespondingto false predicate (no branch)

      • Compile the program in such a way that the false predicate correspondsto the mostprobableevent, ifknown or predictable

  • On- conditionexecutionofthisinstructionstream

    • Save the RG stateduringthison-conditionphase

    • Additionalunitfor RG copy; uniqueidentifiers (IC) associated

  • When the valuesfor predicate evaluation are updated, thenverify the correctnessof the on-conditionexecution, and, if the predictionwasincorrect, apply a recoveryactionusing the saved RG state.

  • MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Branch prediction2
    Branchprediction

    IU

    EU

    General Register Unit

    Reordering Buffer

    several versions of General Register values, with associated unique identifiers

    (e.g. 512 – 1024 registers)

    • Example of out-of-order execution.

    • From a conceptual viewpoint, the problem is similar to the implementation of an “ordering collector” in a farm structure.

    • Complexity, chip area and power consumption are increased by similar techniques.

    • Not necessarily it is a cost-effective technique.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Out of order behaviour
    Out-of-orderbehaviour

    • Branch prediction is a typical case of out-of-oder behaviour.

    • Other out-of-order situations can be recognized: for example see Solution of Exercize 1 about the best-case assumption for cache block transfer evaluation:

    LOAD RA, Rj, Ra

    LOOP_j: LOAD RB, Rj, Rb

    INCR Rj

    MUL Ra, Rb, Ra

    ADD Rc, Ra, Rc

    IF < Rj, RM, LOOP_j, delayed_branch

    LOAD RA, Rj, Ra

    • The base-case behaviourisimplementableprovidedthatLOAD RB, … can beexecutedduring the cache fault handlingcausedbyLOAD RA, ….Thatis, EU mustbeableto “skip” instructionLOAD RA, … and toexecuteLOAD RB, …, postponing the executionofLOAD RA, … until the source operandis sent by DM (LOAD RA, … instructionhasbeenreceivedby EU, butithasbeensaved in a EU internal buffer).

    • This can bedone

    • by a properinstructionordering in the program code (e.g. the twoLOADs can beexchanged), or

    • bycompiler-providedannotations in instructions (ifthisfacilityexists in the definitionof the assemblermachine), or

    • by some rules in the firmwareinterpreterof EU: e.g., the executionorderof a sequenceof LOAD instructions, receivedby EU, can bealtered (no logicaldependencies can existbetweenthemiftheyhavebeen sent to EU).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Superscalar architectures a first broad outline
    Superscalararchitectures: a first broadoutline

    Basic case: 2-issue superscalar

    • In our Pipelined CPU architecture, each element of the instruction stream generated by IM consists of 2 consecutive instructions

      • Instruction cache is 2-way interleaved or each cell contains 2 words (“Long Word”)

    • IU prepares both instructions in parallel (during at most 2 clock cycles) and delivers them to DM/EU, or executes branches, provided that they are executable (valid instructions and no logical dependency)

      • It is also possibile that the first instruction induces a logical dependency on the second one, or it is a branch

    • DM and EU can be realized with sufficient bandwidth to execute 2 instructions in 2 clock cycles

    • Performance achievement: the communication latency (2t) is fully masked:

      Pid= 2/Tid = 1/t = fclockinstructions/sec

      • e.g. fclock = 4 GH2, Pid= 4 GIPS

        though performance degradations are greater than in the basic Pipelined CPU (e is lower), unless more powerful optimizations are introduced

      • branch prediction, out-of-order

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Superscalar cpus
    SuperscalarCPUs

    • In general, n-issue superscalar architectures can be realized

      • n = 2 – 4 – 8 – 16

    instruction 0

    instruction 1

    instruction 2

    instruction 3

    Long Word

    Reordering Buffer,

    DynamicRegisterAllocation

    IU

    Modular Data Cache

    Parallel – Pipelined EU

    Float Registers

    Fixed

    Fixed

    Float

    Float

    Float

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Superscalar and vliw
    Superscalar and VLIW

    • All the performance degradationproblems are increased

    • The Long Word processing and the out-of-ordercontrolintroducesseriousproblemsofpowerconsumption

      • unless some features are sacrificed,

      • e.g. less cache capacity (or no cache at all !) in ordertofindspacetoReordering Buffer and tolargeintra-chiplinks

    • Thisisoneof the mainreasonsfor the “Moore Law” crisis

    • Trend: multicorechips, whereeachcoreis a basic, in-orderPipelined CPU, or a 2-issue in-ordersuperscalar CPU.

    • VLIW (Very Long Instruction Word) architectures: the compiler ensuresthatall the instructionsbelongingto a (Very) Long Word are independent (and staticbranchpredictiontechniques are applied)

      • achievement: reducedcomplexityof the firmwaremachine.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    A general architectural model for instruction level parallelism
    A generalarchitecturalmodelforInstructionLevelParallelism

    • The subjectofSuperscalararchitectureswillnotbestudied in more depth

      • furtherelementswillbeintroducedwhenneeded

      • in a succesive part of the Course, the Multithreadingarchitectueswillbestudied

    • However, itisimportanttounderstand the generalconceptualframeworkofInstructionLevelParallelism:

      • Data-flowcomputationalmodel and Data-flowmachines

        (all the superscalar/multithreading architecturesapplythismodel, at least in part)

        (the commercial flop of some interestingarchitectures–including VLIW and pure data-flow – remainsan ICT mystery …)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Data flow computational model
    Data-flowcomputationalmodel

    • The executable version (assembler level) of a program is represented as a graph(called data-flow graph), and not as a linear sequence of instructions

      • Nodes correspond to instructions

      • Arcs correspond to channels of values

        flowing between instructions

    • Example:

      (x + y) * sqrt(x + y) + y – z

      • application of Bernstein conditions

    data

    token

    operator

    An instruction is enabled, and can be executed, if and only if its input values (“tokens”) are ready.

    Once executed, the input tokens are removed, and the result token is generated and sent to other instructions.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Data flow computational model1
    Data-flowcomputationalmodel

    • Purelyfunctionalmodelofcomputationat the assemblerlevel

      • a non Von Neumannmachine.

    • No variables (conceptually: no memory)

    • No instructioncounter

    • Logicaldependencies are notviewedas performance degradationsources, on the contrarythey are theonlymechanismforinstructionordering

      • only the strictlynecessarylogicaldependencies are present in a data-flow program,

      • consistencyproblems and out-of-orderexecutionissues don’t exist or are solvedimplicitly and automatically, withoutadditionalmechanismsexceptcommunication.

    • Moreover, the process / threadconceptdisappears: instructionsbelongingtodistinctprograms can beexecutedsimultaneously

      • a data-flow instructionis the unitofparallelism(i.e., itis a process)

      • thisconceptis the basicmechanismof multithreading / hyperthreadingmachines.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Data flow graphs
    Data-flowgraphs

    Conditional expression:

    if p(x) then f(x) else g(x)

    Iterative expression:

    while p(x) do new X = f(X)

    Switch operator

    Control token

    Merge operator

    Any data-flow program (data-flow graph) is compiled as the composition of arithmetic, conditional and iterative expressions (data-flow subgraphs), as well as of recursive functions (not shown).

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Basic data flow architecture
    Basic data-flow architecture

    Data-flow instructions are encoded as information packets corresponding to segments of the data-flow graph:

    Control token

    True / False gate

    Input channel

    Reference to destination channel

    Input channel representation

    Gating

    code

    Gating

    flag

    Data ready

    Data value

    Result destination channels

    Analogy: a (fine-grain) process in the message-passing model.

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    Basic data flow architecture1
    Basic data-flow architecture

    Stores a result value into an input channel of the destination instruction, verifies the instruction enabling (do all input channels of this instruction contain data?), and, if enabled, sends the instruction to the execution farm.

    Instruction

    Memory

    Notes:

    in thismodel, a memoryis just animplementationofqueueing and communicationmechanisms;

    instructionsof more thanoneprogram can be in executionsimultaneously.

    . . .

    Firmware

    FARM

    . . .

    Emitter

    Structure

    Collection structure

    Pool of Functional Units

    (pipelined)

    MCSN - M. Vanneschi: High Performance Computing Systems and Enabling Platforms


    ad