slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Scalable Multiprocessors PowerPoint Presentation
Download Presentation
Scalable Multiprocessors

Loading in 2 Seconds...

play fullscreen
1 / 21

Scalable Multiprocessors - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

Scalable Multiprocessors. Read Dubois/ Annavaram / Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/ Annavaram / Stenström Chapter 6. What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Multiprocessors' - presley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Scalable Multiprocessors

Read Dubois/Annavaram/Stenström Chapter 5.5-5.6

(COMA architectures could be paper topic)

Read Dubois/Annavaram/Stenström Chapter 6

  • What is a scalable design? (7.1)
  • Realizing programming models (7.2)
  • Scalable communication architectures (SCAs)
    • Message-based SCAs (7.3-7.5)
    • Shared-memory based SCAs (7.6)

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

scalability
Scalability

Goals (P is number of processors)

  • Bandwidth: scale linearly with P
  • Latency: short and independent of P
  • Cost: low fixed cost and scale linearly with P

Example: A bus-based multiprocessor

  • Bandwidth: constant
  • Latency: short and constant
  • Cost: high for infrastructure and then linear

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

organizational issues

Dance-hall memory organization

Distributed memory organization

Organizational Issues
  • Network composed of switches for performance and cost
  • Many concurrent transactions allowed
  • Distributed memory can bring down bandwidth demands

Bandwidth scaling:

    • no global arbitration and ordering
    • broadcast bandwidth fixed and expensive

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

scaling issues
Scaling Issues

Latency scaling:

  • T(n) = Overhead + Channel Time + Routing Delay
  • Channel Time is a function of bandwidth
  • Routing Delay is a function of number of hops in network

Cost scaling:

  • Cost(p,m) = Fixed cost + Incremental Cost (p,m)
  • Design is cost-effective if speedup(p,m) > costup(p,m)

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

physical scaling
Physical Scaling
  • Chip, board, system-level partitioning has a big impact on scaling
  • However, little consensus

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

network transaction primitives
Network Transaction Primitives

Primitives to implement the programming model on a scalable machine

  • One-way transfer between source and destination
  • Resembles a bus transaction but much richer in variety

Examples:

  • A message send transaction
  • A write transaction in a SAS machine

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

bus vs network transactions
Bus vs. Network Transactions

Bus Transactions:

V->P address translation

Fixed

Simple

Global

Direct

One source

Response

Simple

Global order

Network Transactions:

Done at multiple points

Flexible

Support flexible in format

Distributed

Via several switches

Several sources

Rich diversity

Response transaction

No global order

Design Issues:

Protection

Format

Output buffering

Media arbitration

Destination name & routing

Input buffering

Action

Completion detection

Transaction ordering

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

sas transactions
SAS Transactions

Issues:

  • Fixed or variable size transfers
  • Deadlock avoidance and input buffer full

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

sequential consistency
Sequential Consistency

Issues:

  • Writes need acks to signal completion
  • SC may cause extreme waiting times

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

message passing
Message Passing

Multiple flavors of synchronization semantics

  • Blocking versus non-blocking
    • Blocking send/recv returns when operation completes
    • Non-blocking returns immediately (probe function tests completion)
  • Synchronous
    • Send completes after matching receive has executed
    • Receive completes after data transfer from matching send completes
  • Asynchronous (buffered, in MPI terminology)
    • Send completes as soon as send buffer may be reused

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

synchronous mp protocol
Synchronous MP Protocol

Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

asynchronous optimistic mp protocol
Asynchronous Optimistic MP Protocol

Issues:

  • Copying overhead at receiver from temp buffer to user space
  • Huge buffer space at receiver to cope with worst case

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

asynchronous robust mp protocol
Asynchronous Robust MP Protocol

Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

active messages
Active Messages
  • User-level analog of network transactions
    • transfer data packet and invoke handler to extract it from network and integrate with on-going computation

Request

handler

Reply

handler

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

challenges common to sas and mp
Challenges Common to SAS and MP
  • Input buffer overflow: how to signal buffer space is exhausted

Solutions:

    • ACK at protocol level
    • back pressure flow control
    • special ACK path or drop packets (requires time-out)
  • Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network

Solutions:

    • two logically independent request/response networks
    • NACK requests at receiver to free space

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

spectrum of designs
Spectrum of Designs
  • None, physical bit stream
    • blind, physical DMA nCUBE, iPSC, . . .
  • User/System
    • User-level port CM-5, *T
    • User-level handler J-Machine, Monsoon, . . .
  • Remote virtual address
    • Processing, translation Paragon, Meiko CS-2
  • Global physical address
    • Proc + Memory controller RP3, BBN, T3D
  • Cache-to-cache
    • Cache controller Dash, KSR, Flash

Increasing HW Support, Specialization, Intrusiveness, Performance (???)

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

mp architectures

Scalable Network

Message

Input Processing

– checks

– translation

– buffering

– action

Output Processing

– checks

– translation

– formatting

– scheduling

° ° °

CA

Communication Assist

CA

Node Architecture

M

P

M

P

MP Architectures

Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction

  • Physical DMA (7.3)
  • User-level access (7.4)
  • Dedicated message processing (7.5)

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

physical dma

Example: nCUBE/2,

IBM SP1

Physical DMA
  • Node processor packages messages in user/system mode
  • DMA used to copy between network and system buffers
  • Problem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

user level access
User-Level Access

Example: CM-5

  • Network interface mapped into user address space
  • Communication assist does protection checks, translation, etc.

No intervention by kernel except for interrupts

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

dedicated message processing

Network

dest

° ° ° 

Mem

Mem

NI

NI

P

P

M P

M P

User

System

User

System

Dedicated Message Processing

MP does

  • Interprets message
  • Supports message operations
  • Off-loads P with a clean message abstraction

Issues:

  • P/MP communicate via shared memory: coherence traffic
  • MP can be a bottleneck due to all concurrent actions

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

shared physical address space

Scalable Network

Pseudo

memory

Pseudo

processor

Pseudo

memory

Pseudo

processor

M

P

M

P

Shared Physical Address Space
  • Remote read/write performed by pseudo processors
  • Cache coherence issues treated in Ch. 8

PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011