multiprocessors large vs small scale n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Multiprocessors— Large vs. Small Scale PowerPoint Presentation
Download Presentation
Multiprocessors— Large vs. Small Scale

Loading in 2 Seconds...

play fullscreen
1 / 31

Multiprocessors— Large vs. Small Scale - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Multiprocessors— Large vs. Small Scale. Small-Scale MIMD Designs. Memory: centralized with uniform memory access time (UMA) and bus interconnect Examples: SPARCcenter. Large-Scale MIMD Designs. Memory: distributed with non-uniform memory access time (NUMA) and scalable interconnect

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multiprocessors— Large vs. Small Scale' - toyah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
small scale mimd designs
Small-Scale MIMD Designs
  • Memory: centralized with uniform memory access time (UMA) and bus interconnect
  • Examples: SPARCcenter
large scale mimd designs
Large-Scale MIMD Designs
  • Memory: distributed with non-uniform memory access time (NUMA) and scalable interconnect
  • Examples: Cray T3D, Intel Paragon, CM-5
communication models
Communication Models
  • Shared Memory
    • Communication via shared address space
    • Advantages:
      • Ease of programming
      • Lower latency
      • Easier to use hardware controlled caching
  • Message passing
    • Processors have private memories, communicate via messages
    • Advantages:
      • Less hardware, easier to design
      • Focuses attention on costly non-local operations
communication properties
Communication Properties
  • Bandwidth
    • Need high bandwidth in communication
    • Limits in network, memory, and processor
  • Latency
    • Affects performance, since processor wait
    • Affects ease of programming - How to overlap communication and computation.
  • Latency Hiding
    • How can a mechanism help hide latency?
    • Examples: overlap message send with computation, prefetch
small scale shared memory
Small-Scale—Shared Memory
  • Caches serve to:
    • Increase bandwidth versus bus/memory
    • Reduce latency of access
    • Valuable for both private data and shared data
  • What about cache consistency?
the problem of cache coherency
The Problem of Cache Coherency
  • Value of X in memory is 1
  • CPU A reads X – its cache now contains 1
  • CPU B reads X – its cache now contains 1
  • CPU A stores 0 into X
    • CPU A’s cache contains a 0
    • CPU B’s cache contains a 1
multicore computers chip multiprocessors
Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

pollack s rule
Pollack’s Rule
  • Performance increase is roughly proportional to the square root of the increase in complexity

performance  √complexity

  • Power consumption increase is roughly linearly proportional to the increase in complexity

power consumption  complexity

pollack s rule1
Pollack’s Rule

complexitypowerperformance

1 1 1

4 4 2

25 25 5

100s of low complexity cores, each operating at very low power

Ex: Four small cores

complexitypowerperformance

4x1 4x1 4

increasing cpu performance
Increasing CPU Performance

Manycore Chip

  • Composed of hybrid cores
    • Some general purpose
    • Some graphics
    • Some floating point
exascale systems
Exascale Systems
  • Boardcomposed ofmultiple manycore chipssharingmemory
  • Rack composedof multipleboards
  • A room full of these racks

Millions of cores

Exascale systems (1018 Flop/s)

moore s law reinterpreted
Moore’s Law Reinterpreted
  • Number of cores per chip doubles every 2 years
  • Number of threads of execution doubles every 2 years
shared memory mimd
Shared Memory MIMD

Shared memory

Single address space

All processes have access to the pool of shared memory

P

P

P

P

Bus

Memory

shared memory mimd1
Shared Memory MIMD

Each processor executes different instructions asynchronously, using different data

data

CU

PE

data

CU

PE

Memory

data

CU

PE

data

CU

PE

instruction

symmetric multiprocessors smp
Symmetric Multiprocessors (SMP)

MIMD

Shared memory

UMA

Proc

Proc

L1

L1

L2

L2

System bus

I/O

Main Memory

I/O

I/O

symmetric multiprocessors smp1
Symmetric Multiprocessors (SMP)

Characteristics:

Two or more similar processors

Processors share the same memory and I/O facilities

Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor

All processors share access to I/O devices

All processors can perform the same functions

The system is controlled by the operating system

symmetric multiprocessors smp2
Symmetric Multiprocessors (SMP)

Operating system:

Provides tools and functions to exploit the parallelism

Schedules processes or threads across all of the processors

Takes care of

scheduling of threads and processes on processors

synchronization among processors

multicore computers
Multicore Computers

Dedicated L1 Cache

(ARM11 MPCore)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

I/O

Main Memory

I/O

I/O

multicore computers1
Multicore Computers

Dedicated L2 Cache

(AMD Opteron)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

L2

I/O

Main Memory

I/O

I/O

multicore computers2
Multicore Computers

Shared L2 Cache

(Intel Core Duo)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

I/O

Main Memory

I/O

I/O

multicore computers3
Multicore Computers

Shared L3 Cache

(Intel Core i7)

CPU

core 1

CPU

core n

L1-I

L1-D

L1-I

L1-D

L2

L2

L3

I/O

Main Memory

I/O

I/O

multicore computers4
Multicore Computers

Advantages of Shared L2 cache

Reduced overall miss rate

Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache

Data shared by multiple cores is not replicated

The amount of shared cache allocated to each core may be dynamic

Interprocessor communication is easy to implement

Advantages of Dedicated L2 cache

Each core can access its private cache more rapidly

L3 cache

When the amount of memory and number of cores grow, L3 cache provides better performance

multicore computers5
Multicore Computers

On-chip interconnects

Bus

Crossbar

Off-chip communication (CPU-to-CPU or I/O):

Bus-based

multicore computers chip multiprocessors1
Multicore Computers (chip multiprocessors)

Combine two or more processors (cores) on a single piece of silicon

Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches

Multithreading is used

multicore computers6
Multicore Computers

Multithreading

A multithreaded processor provides a separate PC for each thread (hardware multithreading)

Implicit multithreading

Concurrent execution of multiple threads extracted from a single sequential program

Explicit multithreading

Execute instructions from different explicit threads by interleaving instructions from different threads on shared or parallel pipelines

multicore computers7
Multicore Computers

Explicit Multithreading

Fine-grained multithreading (Interleavedmultithreading)

Processor deals with two or more thread contexts at a time

Switching from one thread to another at each clock cycle

Coarse-grained multithreading (Blockedmultithreading)

Instructions of a thread are executed sequentially until an event that causes a delay (eg. cache miss) occurs

This event causes a switch to another thread

Simultaneous multithreading (SMT)

Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor

Thread-level parallelism is combined with instruction-level parallelism (ILP)

Chip multiprocessing (CMP)

Each processor of a multicore system handles separate threads

gpus graphics processing units
GPUs (Graphics Processing Units)

Characteristics of GPUs

GPUs are accelerators for CPUs

SIMD

GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)

CPU-GPU combination is an example for heterogeneous computing

GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU