computer architecture principles dr mike frank
Download
Skip this Video
Download Presentation
Computer Architecture Principles Dr. Mike Frank

Loading in 2 Seconds...

play fullscreen
1 / 43

Computer Architecture Principles Dr. Mike Frank - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

Computer Architecture Principles Dr. Mike Frank. CDA 5155 (UF) / CA 714-R (NTU) Summer 2003 Module #34 Introduction to Multiprocessing. Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Computer Architecture Principles Dr. Mike Frank' - jera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computer architecture principles dr mike frank

Computer Architecture PrinciplesDr. Mike Frank

CDA 5155 (UF) / CA 714-R (NTU)Summer 2003

Module #34

Introduction to Multiprocessing

h p chapter 6 multiprocessing
Introduction

Application Domains

Symmetric Shared Memory Architectures

Their performance

Distributed Shared Memory Architectures

Their performance

Synchronization

Memory consistency

Multithreading

Crosscutting Issues

Example: Sun Wildfire

Multitheading example

Embedded multiprocs.

Fallacies & Pitfalls

Concluding remarks

Historical perspective

H&P chapter 6 - Multiprocessing

But, I will begin with some of my own material on cost-efficiencyand scalability of physically realisticparallel architectures.

capacity scaling some history
Capacity Scaling – Some History
  • How can we increase the size & complexity of computations that can be performed?
    • Quantified as number of bits of memory required
  • Capacity scaling models:
    • Finite State Machines (a.k.a. Discrete Finite Automata):
      • Increase bits of state → Exponential increase in:
        • number of states & transitions, size of state-transition table
      • Infeasible to scale to large # bits – complex design, unphysical
    • Uniprocessor (serial) models:
      • Turing machine, von Neumann machine (RAM machine) (1940’s)
      • Leave processor complexity constant…
        • Just add more memory!
      • But, this is not a cost-effective way to scale capacity!
    • Multiprocessor models:
      • Von Neumann’s Cellular Automaton (CA) models (1950’s)
      • Keep individual processors simple, just have more of them
        • Design complexity stays manageable
      • Scale amount of processing & memory together

worst

best

why multiprocessing
Why Multiprocessing?
  • A pretty obvious idea:
    • Any given serial processor has a maximum speed:
      • Operations/second X.
    • Therefore, N such processors together will have a larger total max raw performance than this:
      • namely NX operations per second
    • If a computational task can be divided among these N processors, we may reduce its execution time by some speedup factor,  N.
      • Usually is at least slightly <N, due to overheads.
      • Exact factor depends on the nature of the application.
      • In extreme cases, speedup factor may be much less than N, or even 1 (no speedup)
multiprocessing cost efficiency
Multiprocessing & Cost-Efficiency
  • For a given application, which is more cost-effective, a uniprocessor or a multiprocessor?
    • N-processor system cost, 1st-order approximation:
    • N-processor execution time:
    • Focus on overall cost-per-performance (est.)

Measures cost to renta machine for the job(assuming fixeddepreciation lifetime).

cost efficiency cont
Cost-Efficiency Cont.
  • Uniprocessor cost/performance:

(C/P)uni = (Cfixed + Cproc) ·Tser

  • N-way multiprocessor cost/performance:

(C/P)N = (Cfixed + N·Cproc) ·TserN·effN

  • The multiprocessor wins if and only if: (C/P)N < (C/P)unieffN > (N1 + r)/(1+r)r < (effN N1)/(1  effN)where r = Cproc / Cfixed. Pick N to maximize (1+r)·N·effN /(1 + N·r)
parallelizability
Parallelizability
  • An application or algorithm is parallelizable to the extent that adding more processors can reduce its execution time.
  • A parallelizable application is:
    • Communication-intensive if its performance is primarily limited by communication latencies.
      • Requires a tightly-coupled parallel architecture.
        • Means, low communication latencies between CPUs
    • Computation-intensive if its performance is primarily limited by speeds of individual CPUs.
      • May use a loosely-coupled parallel architecture
      • Loose coupling may even help! (b/c of heat removal.)
performance models
Performance Models
  • For a given architecture, a performance model of the architecture is:
    • an abstract description of the architecture that allows one to predict what the execution time of given parallel programs will be on that architecture.
  • Naïve performance models might make dangerous simplifying assumptions, such as:
    • Any processor will be able to access shared memory at the maximum bandwidth at any time.
    • A message from any processor to any other will arrive within n seconds.
  • Watch out! Such assumptions may be flawed...
classifying parallel architectures
Classifying Parallel Architectures
  • What’s parallelized? Instructions / data / both?
    • SISD: Single Instruction, Single Data (uniproc.)
    • SIMD: Single Instruction, Multiple Data (vector)
    • MIMD: Multiple Instruction, Mult. Data (multprc.)
    • MISD: (Special purpose stream processors)
  • Memory access architectures:
    • Centralized shared memory (fig. 6.1)
      • Uniform Memory Access (UMA)
    • Distributed shared memory (fig. 6.2)
      • Non-Uniform Memory Access (NUMA)
    • Distributed, non-shared memory
      • Message Passing Machines / Multicomputers / Clusters
centralized shared memory
Centralized Shared Memory

A.k.a.symmetricmultiprocessor.

A typicalexamplearchitecture.

Typically, only 2 to a few dozenprocessors.

After this,

memory BWbecomes veryrestrictive.

distributed shared memory
Distributed Shared Memory

Advantages: Memory BW scales w. #procs; local mem. latency kept small

dsp vs multicomputers
DSP vs. Multicomputers
  • Distributed shared-memory architectures:
    • Although each processor is close to some memory,all processors still share the same address space.
      • Memory system responsible for maintaining consistency between each processor’s view of the address space.
  • Distributed non-shared memory architectures:
    • Each processor has its own address space.
      • Many independent computers → “multicomputer”
      • COTS computers, network → “cluster”
    • Processors communicate w. explicit messages
      • Can still layer shared object abstractions on top of this infrastructure via software.
communications in multiprocs
Communications in Multiprocs.
  • Communications performance metrics:
    • Node bandwidth – bit-rate in/out of each proc.
    • Bisection bandwidth – b-rate between mach. halves
    • Latency – propagation delay across mach. diameter
  • Tightly coupled (localized) vs.loosely coupled (distributed) multiprocessors:
    • Tightly coupled: High bisection BW, low latency
    • Loosely coupled: Low bisection BW, high latency
  • Of course, you can also have a loosely-coupled (wide-area) network of (internally) tightly-coupled clusters.
shared mem vs message passing
Shared mem. vs. Message-Passing
  • Advantages of shared memory:
    • Straightforward, compatible interfaces - OpenMP.
    • Ease of applic. programming & compiler design.
    • Lower comm. overhead for small items
      • Due to HW support
    • Use automatic caching to reduce comm. needs
  • Advantages of message passing:
    • Hardware is simpler
    • Communication explicit → easier to understand
    • Forces programmer to think about comm. costs
      • Encourages improved design of parallel algorithms
    • Enables more efficient parallel algs. than automatic caching could ever provide
scalability maximal scalability
Scalability & Maximal Scalability
  • A multiprocessor architecture & accompanying performance model is scalable if:
    • it can be “scaled up” to arbitrarily large problem sizes, and/or arbitrarily large numbers of processors, without the predictions of the performance model breaking down.
  • An architecture (& model) is maximally scalable for a given problem if
    • it is scalable, and if no other scalable architecture can claim asymptotically superior performance on that problem
  • It is universally maximally scalable (UMS) if it is maximally scalable on all problems!
    • I will briefly mention some characteristics of architectures that are universally maximally scalable
universal maximum scalability
Universal Maximum Scalability
  • Existence proof for universally maximally scalable (UMS) architectures:
    • Physics itself can be considered a universal maximally scalable “architecture” because any real computer is just a special case of a physical system.
      • So, obviously, no real class of computers can beat the performance of physical systems in general.
    • Unfortunately, physics doesn’t give us a very simple or convenient programming model.
      • Comprehensive expertise at “programming physics” means mastery of all physical engineering disciplines: chemical, electrical, mechanical, optical, etc.
    • We’d like an easier programming model than this!
simpler ums architectures
Simpler UMS Architectures
  • (I propose) any practical UMS architecture will have the following features:
    • Processing elements characterized by constant parameters (independent of # of processors)
    • Mesh-type message-passing interconnection network, arbitrarily scalable in 2 dimensions
      • w. limited scalability in 3rd dimension.
    • Processing elements that can be operated in an arbitrarily reversible way, at least, up to a point.
      • Enables improved 3-d scalability in a limited regime
    • (In long term) Have capability for quantum-coherent operation, for extra perf. on some probs.
shared memory isn t scalable
Shared Memory isn’t Scalable
  • Any implementation of shared memory requires communication between nodes.
  • As the # of nodes increases, we get:
    • Extra contention for any shared BW
    • Increased latency (inevitably).
  • Can hide communication delays to a limited extent, by latency hiding:
    • Find other work to do during the latency delay slot.
    • But the amount of “other work” available is limited by node storage capacity, parallizability of the set of running applications, etc.
global unit time message passing isn t scalable
Global Unit-Time Message Passing Isn’t Scalable!
  • Naïve model: “Any node can pass a message to any other in a single constant-time interval”
    • independent of the total number of nodes
  • Has same scaling problems as shared memory
  • Even if we assume that BW contention (traffic) isn’t a problem, unit-time assumption is still a problem.
    • Not possible for all N, given speed-of-light limit!
    • Need cube root of N asymptotic time, at minimum.
many interconnect topologies aren t scalable
Many Interconnect Topologies Aren’t Scalable!
  • Suppose we don’t require a node can talk to any other in 1 time unit, but only to selected others.
  • Some such schemes still have scalability problems, e.g.:
    • Hypercubes, fat hypercubes
    • Binary trees, fat-trees
    • Crossbars, butterfly networks
  • Any topology in which the number of unit-time hops to reach any of N nodes is of order less than N1/3 is necessarily doomed to failure!

See lastyear’s exams.

only meshes or subgraphs of meshes are scalable
Only Meshes (or subgraphs of meshes) Are Scalable
  • 1-D meshes
    • Linear chain, ring, star (w. fixed # of arms)
  • 2-D meshes
    • Square grid, hex grid, cylinder, 2-sphere, 2-torus,…
  • 3-D meshes
    • Crystal-like lattices, w. various symmetries
    • Amorphous networks w. local interactions in 3d
    • An important caveat:
      • Scalability in 3rd dimension is limited by energy/information I/O considerations! More later…

(Vitányi, 1988)

which approach will win
Which Approach Will Win?
  • Perhaps, the best of all worlds?
  • Here’s one example of a near-future, parallel computing scenario that seems reasonably plausible:
    • SMP architectures within smallest groups of processors on the same chip (chip multiprocessors), sharing a common bus and on-chip DRAM bank.
    • DSM architectures w. flexible topologies to interconnect larger (but still limited-size) groups of processors in a package-level or board-level network.
    • Message-passing w. mesh topologies for communication between different boards in a cluster-in-a-box (blade server),or higher level conglomeration of machines.

But, what about the heat removal problem?

landauer s principle
Landauer’s Principle

Famous IBMresearcher’s1961 paper

  • We know low-level physics is reversible:
    • Means, the time-evolution of a state is bijective
    • Change is deterministic looking backwards in time
      • as well as forwards
  • Physical information (like energy) is conserved
    • It cannot ever be created or destroyed,
      • only reversibly rearranged and transformed!
    • This explains the 2nd Law of Thermodynamics:
      • Entropy (unknown info.) in a closed, unmeasured system can only increase (as we lose track of its state)
  • Irreversible bit “erasure” really just moves the bit into surroundings, increasing entropy & heat
illustrating landauer s principle

s″2N−1

s″N−1

s′N−1

sN−1

s″0

s″N

s′0

s0

1

0

0

1

0

0

0

0

Landauer’s Principle from basic quantum theory

Illustrating Landauer’s principle

Before bit erasure:

After bit erasure:

Nstates

Unitary(1-1)evolution

2Nstates

Nstates

Increase in entropy: S = log 2 = k ln 2. Energy lost to heat: ST = kT ln 2

scaling in 3rd dimension
Scaling in 3rd Dimension?
  • Computing based on ordinary irreversible bit operations only scales in 3d up to a point.
    • All discarded information & associated energy must be removed thru surface. But energy flux is limited!
    • Even a single layer of circuitry in a high-performance CPU can barely be kept cool today!
  • Computing with reversible, “adiabatic” operations does better:
    • Scales in 3d, up to a point…
    • Then with square root of further increases in thickness, up to a point. (Scales in 2.5 dimensions!)
    • Enables much larger thickness than irreversible!
reversible 3 d mesh
Reversible 3-D Mesh

Note the differingpower laws!

cost efficiency of reversibility
Cost-Efficiency of Reversibility

Scenario: $1,000, 100-Watt conventional computer, w.3-year lifetime, vs. reversible computers of same storagecapacity.

~100,000×

~1,000×

Best-case reversible computing

Bit-operations per US dollar

Worst-case reversible computing

Conventional irreversible computing

All curves would →0 if leakage not reduced.

example parallel applications
Example Parallel Applications

“Embarassinglyparallel”

  • Computation-intensive applications:
    • Factoring large numbers, cracking codes
    • Combinatorial search & optimization problems:
      • Find a proof of a theorem, or a solution to a puzzle
      • Find an optimal engineering design or data model, over a large space of possible design parameter settings
    • Solving a game-theory or decision-theory problem
    • Rendering an animated movie
  • Communication-intensive applications:
    • Physical simulations (sec. 6.2 has some examples)
      • Also multiplayer games, virtual work environments
    • File serving, transaction processing in distributed database systems
h p chapter 6 multiprocessing1
Introduction

Application Domains

Symmetric Shared Memory Architectures

Their performance

Distributed Shared Memory Architectures

Their performance

Synchronization

Memory consistency

Multithreading

Crosscutting Issues

Example: Sun Wildfire

Multitheading example

Embedded multiprocs.

Fallacies & Pitfalls

Concluding remarks

Historical perspective

H&P chapter 6 - Multiprocessing
more about smps 6 3
More about SMPs (6.3)
  • Caches help reduce each processor’s mem. bandwidth
    • Means many processors can share total memory BW
  • Microprocessor-based Symmetric MultiProcessors (SMPs) emerged in the 80’s
    • Very cost effective, up to limit of memory BW
  • Early SMPs had 1 CPUper board (off backplane)
    • Now multiple per-board,per-MCM, or even per die
  • Memory system caches bothshared and private (local) data
    • Private data in 1 cache only
    • Shared data may be replicated
cache coherence problem
Cache Coherence Problem
  • Goal: All processors should have a consistent view ofthe shared memory contents, and how they change.
    • Or, as nearly consistent as we can manage.
  • The fundamental difficulty:
    • Written information takes time to propagate!
      • E.g.,A writes, then Bwrites, then A reads (like WAW hazard)
        • A might see the value from A, instead of the value from B
  • A simple, but inefficient solution:
    • Have all writes cause all processors to stall (or at least, not perform any new accesses) until all have received the result of the write.
      • Reads, on the other hand, can be reordered amongst themselves.
    • But: Incurs a worst-case memory stall on each write step!
      • Can alleviate this by allowing writes to occur only periodically
        • But this reduces bandwidth for writes
        • And increases avg. latency for communication through shared memory
another interesting method
Another Interesting Method

Research by Chris Carothers at RPI

  • Maintain a consistent system“virtual time” modeled by all processors.
    • Each processor asynchronously tracks its local idea of the current virtual time. (Local Virtual Time)
  • On a write, asynchronously send invalidate messages timestamped with the writer’s LVT.
  • On receiving an invalidate message stamped earlier than the reader’s LVT,
    • Roll back the local state to that earlier time
      • There are efficient techniques for doing this
  • If timestamped later than the reader’s LVT,
    • Queue it up until the reader’s LVT reaches that time

(This is anexample ofspeculation.)

frank lewis rollback method
Frank-Lewis Rollback Method

Steve Lewis’ MS thesis, UF, 2001

(Reversible MIPS Emulator & Debugger)

  • Fixed-size window
    • Limits how far back you can go.
  • Periodically store checkpoints of machine state
    • Each checkpoint records changes needed
      • to get back to that earlier state from next checkpoint,
      • or from current state if it’s the last checkpoint
    • Cull out older checkpoints periodically
      • so the total number stays logarithmic in the size of the window.
  • Also, store messages received during time window
  • To go backwards Δt steps (to time told = tcur− Δt),
    • Revert machine state to latest checkpoint preceding time told
      • Apply changes recorded in checkpoints from tcur on backwards
    • Compute forwards from there to time told
  • Technique is fairly time- and space- efficient
definition of coherence
Definition of Coherence
  • A weaker condition than full consistency.
  • A memory system is called coherent if:
    • Reads return the most recent value written locally,
      • if no other processor wrote the location in the meantime.
    • A read can return the value written by another processor,if the times are far enough apart.
      • And, if nobody else wrote the location in between
    • Writes to any given location are serialized.
      • If A writes a location, then B writes the location, all processors first see the value written by A, then (later) the value written by B.
      • Avoids WAW hazards leaving cache in wrong state.
cache coherence protocols
Cache Coherence Protocols
  • Two common types: (Differ in how they track blocks’ sharing state)
    • Directory-based:
      • sharing status of a block is kept in a centralized directory
    • Snooping (or “snoopy”):
      • Sharing status of each block is maintained (redundantly) locally by each cache
      • All caches monitor or snoop (eavesdrop) on the memory bus,
        • to notice events relevant to sharing status of blocks they have
  • Snooping tends to be more popular
write invalidate protocols
Write Invalidate Protocols
  • When a processor wants to write to a block,
    • It first “grabs ownership” of that block,
    • By telling all other processors to invalidate their own local copy.
  • This ensures coherence, because
    • A block recently written is cached in 1 place only:
      • The cache of the processor that most recently wrote it
    • Anyone else who wants to write that block will first have to grab back the most recent copy.
      • The block is also written to memory at that time.

Analogous to using RCS to lock files

meaning of bus messages
Meaning of Bus Messages
  • Write miss on block B:
    • “Hey, I want to write block B. Everyone, give me the most recent copy if you’re the one who has it. And everyone, also throw away your own copy.”
  • Read miss on block B:
    • “Hey, I want to read block B. Everyone, give me the most recent copy, if you have it. But you don’t have to throw away your own copy.”
  • Writeback of block B:
    • “Here is the most recent copy of block B, which I produced. I promise not to make any more changes until I after I ask for ownership back and receive it.”
write update coherence protocol
Write-Update Coherence Protocol
  • Also called write broadcast.
  • Strategy: Update all cached copies of a block when the block is written.
  • Comparison versus write-invalidate:
    • More bus traffic for multiple writes by 1 processor
    • Less latency for data to be passed between proc’s.
  • Bus & memory bandwidth is a key limiting factor!
    • Write-invalidate usually gives best overall perf.
ad