parallel processor organizations n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
PARALLEL PROCESSOR ORGANIZATIONS PowerPoint Presentation
Download Presentation
PARALLEL PROCESSOR ORGANIZATIONS

Loading in 2 Seconds...

play fullscreen
1 / 40

PARALLEL PROCESSOR ORGANIZATIONS - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

PARALLEL PROCESSOR ORGANIZATIONS. Jehan-François Pâris jfparis@uh.edu. Chapter Organization. Overview Writing parallel programs Multiprocessor Organizations Hardware multithreading Alphabet soup (SISD, SIMD, MIMD, …) Roofline performance model. OVERVIEW. The hardware side.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'PARALLEL PROCESSOR ORGANIZATIONS' - coen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
parallel processor organizations

PARALLEL PROCESSOR ORGANIZATIONS

Jehan-François Pâris

jfparis@uh.edu

chapter organization
Chapter Organization
  • Overview
  • Writing parallel programs
  • Multiprocessor Organizations
  • Hardware multithreading
  • Alphabet soup (SISD, SIMD, MIMD, …)
  • Roofline performance model
the hardware side
The hardware side
  • Many parallel processing solutions
    • Multiprocessor architectures
      • Two or more microprocessor chips
      • Multiple architectures
    • Multicore architectures
      • Several processors on a single chip
the software side
The software side
  • Two ways for software to exploit parallel processing capabilities of hardware
    • Job-level parallelism
      • Several sequential processes run in parallel
      • Easy to implement (OS does the job!)
    • Process-level parallelism
      • A single program runs on several processors at the same time
overview1
Overview
  • Some problems are embarrassingly parallel
    • Many computer graphics tasks
    • Brute force searches in cryptography or password guessing
  • Much more difficult for other applications
    • Communication overhead among sub-tasks
    • Amdahl's law
    • Balancing the load
amdahl s law
Amdahl's Law
  • Assume a sequential process takes
    • tp seconds to perform operations that could be performed in parallel
    • ts seconds to perform purely sequential operations
  • The maximum speedup will be

(tp+ ts )/ts

balancing the load
Balancing the load
  • Must ensure that workload is equally divided among all the processors
  • Worst case is when one of the processors does much more work than all others
example i
Example (I)
  • Computation partitioned amongnprocessors
  • One of them does 1/m of the work with m < n
    • That processor becomes a bottleneck
  • Maximum expected speedup: n
  • Actual maximum speedup: m
example ii
Example (II)
  • Computation partitioned among64processors
  • One of them does 1/8 of the work
  • Maximum expected speedup: 64
  • Actual maximum speedup: 8
a last issue
A last issue
  • Humans likes to address issues one after the order
    • We have meeting agendas
    • We do not like to be interrupted
    • We write sequential programs
rene descartes
Rene Descartes
  • Seventeenth-century French philosopher
  • Invented
    • Cartesian coordinates
    • Methodical doubt
      • [To] never to accept anything for true which I did not clearly know to be such
  • Proposed a scientific method based on four precepts
method s third rule
Method's third rule
  • The third, to conduct my thoughts in such order that, by commencing with objects the simplest and easiest to know, I might ascend by little and little, and, as it were, step by step, to the knowledge of the more complex; assigning in thought a certain ordereven to those objects which in their own nature do not stand in a relation of antecedence and sequence.
shared memory multiprocessors

PU

PU

PU

Cache

Cache

Cache

Shared memory multiprocessors

Interconnection network

RAM

I/O

shared memory multiprocessor
Shared memory multiprocessor
  • Can offer
    • Uniform memory access to all processors(UMA)
      • Easiest to program
    • Non-uniform memory access to all processors(NUMA)
      • Can scale up to larger sizes
      • Offer faster access to nearby memory
computer clusters

PU

PU

PU

Cache

Cache

Cache

RAM

RAM

RAM

Computer clusters

Interconnection network

computer clusters1
Computer clusters
  • Very easy to assemble
  • Can take advantage of high-speed LANs
    • Gigabit Ethernet, Myrinet, …
  • Data exchanges must be done throughmessage passing
message passing i
Message passing (I)
  • If processor P wants to access data in the main memory of processor Q it must
    • Send a request to Q
    • Wait for a reply
  • For this to work, processor Q must have a thread
    • Waiting for message from other processors
    • Sending them replies
message passing ii
Message passing (II)
  • In a shared memory architecture, each processor can directly access all data
  • A proposed solution
    • Distributed shared memory offers to the users of a cluster the illusion of a single address space for their shared data
    • Still has performance issues
when things do not add up
When things do not add up
  • Memory capacity is very important for big computing applications
    • If the data can fit into main memory, the computation will run much faster
  • A company replaced
    • Single shared memory computer with 32GB of RAM
a problem
A problem
  • A company replaced
    • Single shared memory computer with 32GB of RAM
    • Four “clustered” computers with 8GB each
  • More I/O than ever
  • What did happen?
the explanation
The explanation
  • Assume OS occupies one GB of RAM
    • The old shared-memory computer still had 31 GB of free RAM
    • Each of the clustered computer has 7 GB of free RAM
  • The total RAM available to the program went down from 31 GB to 47 = 28 GB!
grid computing
Grid computing
  • The computers are distributed over a very large network
    • Sometimes computer time is donated
      • Volunteer computing
      • Seti@Home
    • Works well with embarrassingly parallel workloads
      • Searches in a n-dimensional space
general idea
General idea
  • Let the processor switch to another thread of computation while them current one is stalled
  • Motivation:
    • Increased cost of cache misses
implementation
Implementation
  • Entirely controlled by the hardware
    • Unlike multiprogramming
  • Requires a processor capable of
    • Keeping track of the state of each thread
      • One set of registers—including PC– for each concurrent thread
    • Quickly switching among concurrent threads
approaches
Approaches
  • Fine-grained multithreading:
    • Switches between threads for each instruction
    • Provides highest throughputs
    • Slows down execution of individual threads
approaches1
Approaches
  • Coarse-grained multithreading
    • Switches between threads whenever a long stall is detected
    • Easier to implement
    • Cannot eliminate all stalls
approaches2
Approaches
  • Simultaneous multi-threading:
    • Takes advantage of the possibility of modern hardware to perform different tasks in parallel for instructions of different threads
    • Best solution
overview2
Overview
  • Used to describe processor organizations where
    • Same instructions can be applied to
    • Multiple data instances
  • Encountered in
    • Vector processors in the past
    • Graphic processing units (GPU)
    • x86 multimedia extension
classification
Classification
  • SISD:
    • Single instruction, single data
    • Conventional uniprocessor architecture
  • MIMD:
    • Multiple instructions, multiple data
    • Conventional multiprocessor architecture
classification1
Classification
  • SIMD:
    • Single instruction, multiple data
    • Perform same operations on a set of similar data
      • Think of adding two vectors

for (i = 0; i++; i < VECSIZE) sum[i] = a[i] + b[i];

vector computing
Vector computing
  • Kind of SIMD architecture
    • Used by Cray computers
  • Pipelines multiple executions of single instruction with different data (“vectors”) trough the ALU
  • Requires
    • Vector registers able to storemultiple values
    • Special vector instructions: say lv, addv, …
benchmarking
Benchmarking
  • Two factors to consider
    • Memory bandwidth
      • Depends on interconnection network
    • Floating-point performance
  • Best known benchmark is LINPACK
roofline model
Roofline model
  • Takes into account
    • Memory bandwidth
    • Floating-point performance
  • Introduces arithmetic intensity
    • Total number of floating point operations in a program divided by total number of bytes transferred to main memory
    • Measured in FLOPS/byte
roofline model1
Roofline model
  • Attainable GFLOPS/s = Min(Peak Memory BWArithmetic Intensity, Peak Floating-Point Performance
roofline model2
Roofline model

Peak floating-point performance

Floating-point performance is

limited by memory bandwidth