Computer architecture principles dr mike frank
Download
1 / 43

Computer Architecture Principles Dr. Mike Frank - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

Computer Architecture Principles Dr. Mike Frank. CDA 5155 (UF) / CA 714-R (NTU) Summer 2003 Module #34 Introduction to Multiprocessing. Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computer Architecture Principles Dr. Mike Frank' - jera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computer architecture principles dr mike frank

Computer Architecture PrinciplesDr. Mike Frank

CDA 5155 (UF) / CA 714-R (NTU)Summer 2003

Module #34

Introduction to Multiprocessing


H p chapter 6 multiprocessing

Introduction

Application Domains

Symmetric Shared Memory Architectures

Their performance

Distributed Shared Memory Architectures

Their performance

Synchronization

Memory consistency

Multithreading

Crosscutting Issues

Example: Sun Wildfire

Multitheading example

Embedded multiprocs.

Fallacies & Pitfalls

Concluding remarks

Historical perspective

H&P chapter 6 - Multiprocessing

But, I will begin with some of my own material on cost-efficiencyand scalability of physically realisticparallel architectures.


Capacity scaling some history
Capacity Scaling – Some History

  • How can we increase the size & complexity of computations that can be performed?

    • Quantified as number of bits of memory required

  • Capacity scaling models:

    • Finite State Machines (a.k.a. Discrete Finite Automata):

      • Increase bits of state → Exponential increase in:

        • number of states & transitions, size of state-transition table

      • Infeasible to scale to large # bits – complex design, unphysical

    • Uniprocessor (serial) models:

      • Turing machine, von Neumann machine (RAM machine) (1940’s)

      • Leave processor complexity constant…

        • Just add more memory!

      • But, this is not a cost-effective way to scale capacity!

    • Multiprocessor models:

      • Von Neumann’s Cellular Automaton (CA) models (1950’s)

      • Keep individual processors simple, just have more of them

        • Design complexity stays manageable

      • Scale amount of processing & memory together

worst

best


Why multiprocessing
Why Multiprocessing?

  • A pretty obvious idea:

    • Any given serial processor has a maximum speed:

      • Operations/second X.

    • Therefore, N such processors together will have a larger total max raw performance than this:

      • namely NX operations per second

    • If a computational task can be divided among these N processors, we may reduce its execution time by some speedup factor,  N.

      • Usually is at least slightly <N, due to overheads.

      • Exact factor depends on the nature of the application.

      • In extreme cases, speedup factor may be much less than N, or even 1 (no speedup)


Multiprocessing cost efficiency
Multiprocessing & Cost-Efficiency

  • For a given application, which is more cost-effective, a uniprocessor or a multiprocessor?

    • N-processor system cost, 1st-order approximation:

    • N-processor execution time:

    • Focus on overall cost-per-performance (est.)

Measures cost to renta machine for the job(assuming fixeddepreciation lifetime).


Cost efficiency cont
Cost-Efficiency Cont.

  • Uniprocessor cost/performance:

    (C/P)uni = (Cfixed + Cproc) ·Tser

  • N-way multiprocessor cost/performance:

    (C/P)N = (Cfixed + N·Cproc) ·TserN·effN

  • The multiprocessor wins if and only if: (C/P)N < (C/P)unieffN > (N1 + r)/(1+r)r < (effN N1)/(1  effN)where r = Cproc / Cfixed. Pick N to maximize (1+r)·N·effN /(1 + N·r)


Parallelizability
Parallelizability

  • An application or algorithm is parallelizable to the extent that adding more processors can reduce its execution time.

  • A parallelizable application is:

    • Communication-intensive if its performance is primarily limited by communication latencies.

      • Requires a tightly-coupled parallel architecture.

        • Means, low communication latencies between CPUs

    • Computation-intensive if its performance is primarily limited by speeds of individual CPUs.

      • May use a loosely-coupled parallel architecture

      • Loose coupling may even help! (b/c of heat removal.)


Performance models
Performance Models

  • For a given architecture, a performance model of the architecture is:

    • an abstract description of the architecture that allows one to predict what the execution time of given parallel programs will be on that architecture.

  • Naïve performance models might make dangerous simplifying assumptions, such as:

    • Any processor will be able to access shared memory at the maximum bandwidth at any time.

    • A message from any processor to any other will arrive within n seconds.

  • Watch out! Such assumptions may be flawed...


Classifying parallel architectures
Classifying Parallel Architectures

  • What’s parallelized? Instructions / data / both?

    • SISD: Single Instruction, Single Data (uniproc.)

    • SIMD: Single Instruction, Multiple Data (vector)

    • MIMD: Multiple Instruction, Mult. Data (multprc.)

    • MISD: (Special purpose stream processors)

  • Memory access architectures:

    • Centralized shared memory (fig. 6.1)

      • Uniform Memory Access (UMA)

    • Distributed shared memory (fig. 6.2)

      • Non-Uniform Memory Access (NUMA)

    • Distributed, non-shared memory

      • Message Passing Machines / Multicomputers / Clusters


Centralized shared memory
Centralized Shared Memory

A.k.a.symmetricmultiprocessor.

A typicalexamplearchitecture.

Typically, only 2 to a few dozenprocessors.

After this,

memory BWbecomes veryrestrictive.


Distributed shared memory
Distributed Shared Memory

Advantages: Memory BW scales w. #procs; local mem. latency kept small


Dsp vs multicomputers
DSP vs. Multicomputers

  • Distributed shared-memory architectures:

    • Although each processor is close to some memory,all processors still share the same address space.

      • Memory system responsible for maintaining consistency between each processor’s view of the address space.

  • Distributed non-shared memory architectures:

    • Each processor has its own address space.

      • Many independent computers → “multicomputer”

      • COTS computers, network → “cluster”

    • Processors communicate w. explicit messages

      • Can still layer shared object abstractions on top of this infrastructure via software.


Communications in multiprocs
Communications in Multiprocs.

  • Communications performance metrics:

    • Node bandwidth – bit-rate in/out of each proc.

    • Bisection bandwidth – b-rate between mach. halves

    • Latency – propagation delay across mach. diameter

  • Tightly coupled (localized) vs.loosely coupled (distributed) multiprocessors:

    • Tightly coupled: High bisection BW, low latency

    • Loosely coupled: Low bisection BW, high latency

  • Of course, you can also have a loosely-coupled (wide-area) network of (internally) tightly-coupled clusters.


Shared mem vs message passing
Shared mem. vs. Message-Passing

  • Advantages of shared memory:

    • Straightforward, compatible interfaces - OpenMP.

    • Ease of applic. programming & compiler design.

    • Lower comm. overhead for small items

      • Due to HW support

    • Use automatic caching to reduce comm. needs

  • Advantages of message passing:

    • Hardware is simpler

    • Communication explicit → easier to understand

    • Forces programmer to think about comm. costs

      • Encourages improved design of parallel algorithms

    • Enables more efficient parallel algs. than automatic caching could ever provide



Scalability maximal scalability
Scalability & Maximal Scalability

  • A multiprocessor architecture & accompanying performance model is scalable if:

    • it can be “scaled up” to arbitrarily large problem sizes, and/or arbitrarily large numbers of processors, without the predictions of the performance model breaking down.

  • An architecture (& model) is maximally scalable for a given problem if

    • it is scalable, and if no other scalable architecture can claim asymptotically superior performance on that problem

  • It is universally maximally scalable (UMS) if it is maximally scalable on all problems!

    • I will briefly mention some characteristics of architectures that are universally maximally scalable


Universal maximum scalability
Universal Maximum Scalability

  • Existence proof for universally maximally scalable (UMS) architectures:

    • Physics itself can be considered a universal maximally scalable “architecture” because any real computer is just a special case of a physical system.

      • So, obviously, no real class of computers can beat the performance of physical systems in general.

    • Unfortunately, physics doesn’t give us a very simple or convenient programming model.

      • Comprehensive expertise at “programming physics” means mastery of all physical engineering disciplines: chemical, electrical, mechanical, optical, etc.

    • We’d like an easier programming model than this!


Simpler ums architectures
Simpler UMS Architectures

  • (I propose) any practical UMS architecture will have the following features:

    • Processing elements characterized by constant parameters (independent of # of processors)

    • Mesh-type message-passing interconnection network, arbitrarily scalable in 2 dimensions

      • w. limited scalability in 3rd dimension.

    • Processing elements that can be operated in an arbitrarily reversible way, at least, up to a point.

      • Enables improved 3-d scalability in a limited regime

    • (In long term) Have capability for quantum-coherent operation, for extra perf. on some probs.


Shared memory isn t scalable
Shared Memory isn’t Scalable

  • Any implementation of shared memory requires communication between nodes.

  • As the # of nodes increases, we get:

    • Extra contention for any shared BW

    • Increased latency (inevitably).

  • Can hide communication delays to a limited extent, by latency hiding:

    • Find other work to do during the latency delay slot.

    • But the amount of “other work” available is limited by node storage capacity, parallizability of the set of running applications, etc.


Global unit time message passing isn t scalable
Global Unit-Time Message Passing Isn’t Scalable!

  • Naïve model: “Any node can pass a message to any other in a single constant-time interval”

    • independent of the total number of nodes

  • Has same scaling problems as shared memory

  • Even if we assume that BW contention (traffic) isn’t a problem, unit-time assumption is still a problem.

    • Not possible for all N, given speed-of-light limit!

    • Need cube root of N asymptotic time, at minimum.


Many interconnect topologies aren t scalable
Many Interconnect Topologies Aren’t Scalable!

  • Suppose we don’t require a node can talk to any other in 1 time unit, but only to selected others.

  • Some such schemes still have scalability problems, e.g.:

    • Hypercubes, fat hypercubes

    • Binary trees, fat-trees

    • Crossbars, butterfly networks

  • Any topology in which the number of unit-time hops to reach any of N nodes is of order less than N1/3 is necessarily doomed to failure!

See lastyear’s exams.


Only meshes or subgraphs of meshes are scalable
Only Meshes (or subgraphs of meshes) Are Scalable

  • 1-D meshes

    • Linear chain, ring, star (w. fixed # of arms)

  • 2-D meshes

    • Square grid, hex grid, cylinder, 2-sphere, 2-torus,…

  • 3-D meshes

    • Crystal-like lattices, w. various symmetries

    • Amorphous networks w. local interactions in 3d

    • An important caveat:

      • Scalability in 3rd dimension is limited by energy/information I/O considerations! More later…

(Vitányi, 1988)


Which approach will win
Which Approach Will Win?

  • Perhaps, the best of all worlds?

  • Here’s one example of a near-future, parallel computing scenario that seems reasonably plausible:

    • SMP architectures within smallest groups of processors on the same chip (chip multiprocessors), sharing a common bus and on-chip DRAM bank.

    • DSM architectures w. flexible topologies to interconnect larger (but still limited-size) groups of processors in a package-level or board-level network.

    • Message-passing w. mesh topologies for communication between different boards in a cluster-in-a-box (blade server),or higher level conglomeration of machines.

But, what about the heat removal problem?


Landauer s principle
Landauer’s Principle

Famous IBMresearcher’s1961 paper

  • We know low-level physics is reversible:

    • Means, the time-evolution of a state is bijective

    • Change is deterministic looking backwards in time

      • as well as forwards

  • Physical information (like energy) is conserved

    • It cannot ever be created or destroyed,

      • only reversibly rearranged and transformed!

    • This explains the 2nd Law of Thermodynamics:

      • Entropy (unknown info.) in a closed, unmeasured system can only increase (as we lose track of its state)

  • Irreversible bit “erasure” really just moves the bit into surroundings, increasing entropy & heat


Illustrating landauer s principle

s″2N−1

s″N−1

s′N−1

sN−1

s″0

s″N

s′0

s0

1

0

0

1

0

0

0

0

Landauer’s Principle from basic quantum theory

Illustrating Landauer’s principle

Before bit erasure:

After bit erasure:

Nstates

Unitary(1-1)evolution

2Nstates

Nstates

Increase in entropy: S = log 2 = k ln 2. Energy lost to heat: ST = kT ln 2


Scaling in 3rd dimension
Scaling in 3rd Dimension?

  • Computing based on ordinary irreversible bit operations only scales in 3d up to a point.

    • All discarded information & associated energy must be removed thru surface. But energy flux is limited!

    • Even a single layer of circuitry in a high-performance CPU can barely be kept cool today!

  • Computing with reversible, “adiabatic” operations does better:

    • Scales in 3d, up to a point…

    • Then with square root of further increases in thickness, up to a point. (Scales in 2.5 dimensions!)

    • Enables much larger thickness than irreversible!



Reversible 3 d mesh
Reversible 3-D Mesh

Note the differingpower laws!


Cost efficiency of reversibility
Cost-Efficiency of Reversibility

Scenario: $1,000, 100-Watt conventional computer, w.3-year lifetime, vs. reversible computers of same storagecapacity.

~100,000×

~1,000×

Best-case reversible computing

Bit-operations per US dollar

Worst-case reversible computing

Conventional irreversible computing

All curves would →0 if leakage not reduced.


Example parallel applications
Example Parallel Applications

“Embarassinglyparallel”

  • Computation-intensive applications:

    • Factoring large numbers, cracking codes

    • Combinatorial search & optimization problems:

      • Find a proof of a theorem, or a solution to a puzzle

      • Find an optimal engineering design or data model, over a large space of possible design parameter settings

    • Solving a game-theory or decision-theory problem

    • Rendering an animated movie

  • Communication-intensive applications:

    • Physical simulations (sec. 6.2 has some examples)

      • Also multiplayer games, virtual work environments

    • File serving, transaction processing in distributed database systems



H p chapter 6 multiprocessing1

Introduction

Application Domains

Symmetric Shared Memory Architectures

Their performance

Distributed Shared Memory Architectures

Their performance

Synchronization

Memory consistency

Multithreading

Crosscutting Issues

Example: Sun Wildfire

Multitheading example

Embedded multiprocs.

Fallacies & Pitfalls

Concluding remarks

Historical perspective

H&P chapter 6 - Multiprocessing


More about smps 6 3
More about SMPs (6.3)

  • Caches help reduce each processor’s mem. bandwidth

    • Means many processors can share total memory BW

  • Microprocessor-based Symmetric MultiProcessors (SMPs) emerged in the 80’s

    • Very cost effective, up to limit of memory BW

  • Early SMPs had 1 CPUper board (off backplane)

    • Now multiple per-board,per-MCM, or even per die

  • Memory system caches bothshared and private (local) data

    • Private data in 1 cache only

    • Shared data may be replicated


Cache coherence problem
Cache Coherence Problem

  • Goal: All processors should have a consistent view ofthe shared memory contents, and how they change.

    • Or, as nearly consistent as we can manage.

  • The fundamental difficulty:

    • Written information takes time to propagate!

      • E.g.,A writes, then Bwrites, then A reads (like WAW hazard)

        • A might see the value from A, instead of the value from B

  • A simple, but inefficient solution:

    • Have all writes cause all processors to stall (or at least, not perform any new accesses) until all have received the result of the write.

      • Reads, on the other hand, can be reordered amongst themselves.

    • But: Incurs a worst-case memory stall on each write step!

      • Can alleviate this by allowing writes to occur only periodically

        • But this reduces bandwidth for writes

        • And increases avg. latency for communication through shared memory


Another interesting method
Another Interesting Method

Research by Chris Carothers at RPI

  • Maintain a consistent system“virtual time” modeled by all processors.

    • Each processor asynchronously tracks its local idea of the current virtual time. (Local Virtual Time)

  • On a write, asynchronously send invalidate messages timestamped with the writer’s LVT.

  • On receiving an invalidate message stamped earlier than the reader’s LVT,

    • Roll back the local state to that earlier time

      • There are efficient techniques for doing this

  • If timestamped later than the reader’s LVT,

    • Queue it up until the reader’s LVT reaches that time

(This is anexample ofspeculation.)


Frank lewis rollback method
Frank-Lewis Rollback Method

Steve Lewis’ MS thesis, UF, 2001

(Reversible MIPS Emulator & Debugger)

  • Fixed-size window

    • Limits how far back you can go.

  • Periodically store checkpoints of machine state

    • Each checkpoint records changes needed

      • to get back to that earlier state from next checkpoint,

      • or from current state if it’s the last checkpoint

    • Cull out older checkpoints periodically

      • so the total number stays logarithmic in the size of the window.

  • Also, store messages received during time window

  • To go backwards Δt steps (to time told = tcur− Δt),

    • Revert machine state to latest checkpoint preceding time told

      • Apply changes recorded in checkpoints from tcur on backwards

    • Compute forwards from there to time told

  • Technique is fairly time- and space- efficient


Definition of coherence
Definition of Coherence

  • A weaker condition than full consistency.

  • A memory system is called coherent if:

    • Reads return the most recent value written locally,

      • if no other processor wrote the location in the meantime.

    • A read can return the value written by another processor,if the times are far enough apart.

      • And, if nobody else wrote the location in between

    • Writes to any given location are serialized.

      • If A writes a location, then B writes the location, all processors first see the value written by A, then (later) the value written by B.

      • Avoids WAW hazards leaving cache in wrong state.


Cache coherence protocols
Cache Coherence Protocols

  • Two common types: (Differ in how they track blocks’ sharing state)

    • Directory-based:

      • sharing status of a block is kept in a centralized directory

    • Snooping (or “snoopy”):

      • Sharing status of each block is maintained (redundantly) locally by each cache

      • All caches monitor or snoop (eavesdrop) on the memory bus,

        • to notice events relevant to sharing status of blocks they have

  • Snooping tends to be more popular


Write invalidate protocols
Write Invalidate Protocols

  • When a processor wants to write to a block,

    • It first “grabs ownership” of that block,

    • By telling all other processors to invalidate their own local copy.

  • This ensures coherence, because

    • A block recently written is cached in 1 place only:

      • The cache of the processor that most recently wrote it

    • Anyone else who wants to write that block will first have to grab back the most recent copy.

      • The block is also written to memory at that time.

Analogous to using RCS to lock files


Meaning of bus messages
Meaning of Bus Messages

  • Write miss on block B:

    • “Hey, I want to write block B. Everyone, give me the most recent copy if you’re the one who has it. And everyone, also throw away your own copy.”

  • Read miss on block B:

    • “Hey, I want to read block B. Everyone, give me the most recent copy, if you have it. But you don’t have to throw away your own copy.”

  • Writeback of block B:

    • “Here is the most recent copy of block B, which I produced. I promise not to make any more changes until I after I ask for ownership back and receive it.”




Write update coherence protocol
Write-Update Coherence Protocol

  • Also called write broadcast.

  • Strategy: Update all cached copies of a block when the block is written.

  • Comparison versus write-invalidate:

    • More bus traffic for multiple writes by 1 processor

    • Less latency for data to be passed between proc’s.

  • Bus & memory bandwidth is a key limiting factor!

    • Write-invalidate usually gives best overall perf.