Computer architecture fall 2006 lecture 29 network connected multiprocessors
Download
1 / 26

Download It - PowerPoint PPT Presentation


  • 194 Views
  • Updated On :

Computer Architecture Fall 2006 Lecture 29: Network Connected Multiprocessors. Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design , Patterson & Hennessy, © 2005]. Review: Bus Connected SMPs (UMAs).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Download It' - PamelaLan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computer architecture fall 2006 lecture 29 network connected multiprocessors l.jpg

Computer Architecture Fall 2006Lecture 29: Network Connected Multiprocessors

Adapted from Mary Jane Irwin

( www.cse.psu.edu/~mji )

[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]


Review bus connected smps umas l.jpg
Review: Bus Connected SMPs (UMAs)

  • Caches are used to reduce latency and to lower bus traffic

  • Must provide hardware for cache coherence and process synchronization

  • Bus traffic and bandwidth limits scalability (<~ 36 processors)

Processor

Processor

Processor

Processor

Cache

Cache

Cache

Cache

Single Bus

Memory

I/O


Review multiprocessor basics l.jpg
Review: Multiprocessor Basics

  • Q1 – How do they share data?

  • Q2 – How do they coordinate?

  • Q3 – How scalable is the architecture? How many processors?


Network connected multiprocessors l.jpg

Processor

Processor

Processor

Cache

Cache

Cache

Memory

Memory

Memory

Interconnection Network (IN)

Network Connected Multiprocessors

  • Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores ormultiple private memories with message passing communication with sends and receives

    • Interconnection network supports interprocessor communication


Summing 100 000 numbers on 100 processors l.jpg
Summing 100,000 Numbers on 100 Processors

  • Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel

sum = 0;

for (i = 0; i<1000; i = i + 1)

sum = sum + Al[i]; /* sum local array subset

  • The processors then coordinate in adding together the sub sums (Pn is the number of the processor, send(x,y) sends value y to processor x, and receive() receives a value)

half = 100;

limit = 100;

repeat

half = (half+1)/2; /*dividing line

if (Pn>= half && Pn<limit) send(Pn-half,sum);

if (Pn<(limit/2)) sum = sum + receive();

limit = half;

until (half == 1); /*final sum in P0’s sum


An example with 10 processors l.jpg

sum

sum

sum

sum

sum

sum

sum

sum

sum

sum

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

An Example with 10 Processors

half = 10


Communication in network connected multi s l.jpg
Communication in Network Connected Multi’s

  • Implicit communication via loads and stores

    • hardware designers have to provide coherent caches and process synchronization primitive

    • lower communication overhead

    • harder to overlap computation with communication

    • more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM))

  • Explicit communication via sends and receives

    • simplest solution for hardware designers

    • higher communication overhead

    • easier to overlap computation with communication

    • easier for the programmer to optimize communication


Cache coherency in numas l.jpg
Cache Coherency in NUMAs

  • For performance reasons we want to allow the shared data to be stored in caches

  • Once again have multiple copies of the same data with the same address in different processors

    • bus snooping won’t work, since there is no single bus on which all memory references are broadcast

  • Directory-base protocols

    • keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.)

    • directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention

    • directory controller sends explicit commands over the IN to each processor that has a copy of the data


In performance metrics l.jpg
IN Performance Metrics

  • Network cost

    • number of switches

    • number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor)

    • width in bits per link, length of link

  • Network bandwidth (NB) – represents the best case

    • bandwidth of each link * number of links

  • Bisection bandwidth (BB) – represents the worst case

    • divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line

  • Other IN performance issues

    • latency on an unloaded network to send and receive messages

    • throughput – maximum # of messages transmitted per unit time

    • # routing hops worst case, congestion control and delay


Bus in l.jpg
Bus IN

  • N processors, 1 switch ( ), 1 link (the bus)

  • Only 1 simultaneous transfer at a time

    • NB = link (bus) bandwidth * 1

    • BB = link (bus) bandwidth * 1

Bidirectional

network switch

Processor

node


Ring in l.jpg
Ring IN

  • If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case

  • N processors, N switches, 2 links/switch, N links

  • N simultaneous transfers

    • NB = link bandwidth * N

    • BB = link bandwidth * 2


Fully connected in l.jpg
Fully Connected IN

  • N processors, N switches, N-1 links/switch, (N*(N-1))/2 links

  • N simultaneous transfers

    • NB = link bandwidth * (N*(N-1))/2

    • BB = link bandwidth * (N/2)2


Crossbar xbar connected in l.jpg
Crossbar (Xbar) Connected IN

  • N processors, N2 switches (unidirectional),2 links/switch, N2 links

  • N simultaneous transfers

    • NB = link bandwidth * N

    • BB = link bandwidth * N/2


Hypercube binary n cube connected in l.jpg

3-cube

Hypercube (Binary N-cube) Connected IN

  • N processors, N switches, logN links/switch, (NlogN)/2 links

  • N simultaneous transfers

    • NB = link bandwidth * (NlogN)/2

    • BB = link bandwidth * N/2

2-cube


2d and 3d mesh torus connected in l.jpg
2D and 3D Mesh/Torus Connected IN

  • N simultaneous transfers

    • NB = link bandwidth * 4N or link bandwidth * 6N

    • BB = link bandwidth * 2 N1/2 or link bandwidth * 2 N2/3

  • N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 links


Fat tree l.jpg
Fat Tree

  • N processors, log(N-1)*logN switches, 2 up + 4 down = 6 links/switch, N*logN links

  • N simultaneous transfers

    • NB = link bandwidth * NlogN

    • BB = link bandwidth * 4


Fat tree17 l.jpg
Fat Tree

  • Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.

A

B

C

D

  • Any time A wants to send to C, it ties up the upper links, so that B can't send to D.

    • The bisection bandwidth on a tree is horrible - 1 link, at all times

  • The solution is to 'thicken' the upper links.

    • More links as the tree gets thicker increases the bisection

  • Rather than design a bunch of N-port switches, use pairs


Sgi numalink fat tree l.jpg
SGI NUMAlink Fat Tree

www.embedded-computing.com/articles/woodacre


In comparison l.jpg
IN Comparison

  • For a 64 processor system




A bluegene l chip l.jpg
A BlueGene/L Chip

11GB/s

32K/32K L1

440 CPU

Double FPU

2KB

L2

4MB L3

ECC eDRAM

128B line

8-way assoc

128

256

5.5 GB/s

16KB Multiport SRAM buffer

256

700 MHz

256

32K/32K L1

440 CPU

Double FPU

2KB

L2

128

256

5.5 GB/s

11GB/s

Gbit ethernet

DDR control

3D torus

Fat tree

Barrier

1

8

6 in, 6 out 1.6GHz1.4Gb/s link

3 in, 3 out 350MHz 2.8Gb/s link

4 global barriers

144b DDR 256MB

5.5GB/s


Networks of workstations nows clusters l.jpg
Networks of Workstations (NOWs) Clusters

  • Clusters of off-the-shelf, whole computers with multiple private address spaces

  • Clusters are connected using the I/O bus of the computers

    • lower bandwidth that multiprocessor that use the memory bus

    • lower speed network links

    • more conflicts with I/O traffic

  • Clusters of N processors have N copies of the OS limiting the memory available for applications

  • Improved system availability and expandability

    • easier to replace a machine without bringing down the whole system

    • allows rapid, incremental expandability

  • Economy-of-scale advantages with respect to costs



Summary l.jpg
Summary

  • Flynn’s classification of processors - SISD, SIMD, MIMD

    • Q1 – How do processors share data?

    • Q2 – How do processors coordinate their activity?

    • Q3 – How scalable is the architecture (what is the maximum number of processors)?

  • Shared address multis – UMAs and NUMAs

    • Scalability of bus connected UMAs limited (< ~ 36 processors)

    • Network connected NUMAs more scalable

    • Interconnection Networks (INs)

      • fully connected, xbar

      • ring

      • mesh

      • n-cube, fat tree

  • Message passing multis

  • Cluster connected (NOWs) multis


Next lecture and reminders l.jpg
Next Lecture and Reminders

  • Reminders

    • Final Tuesday, December 12 from 8 – 9:50 AM

  • Next lecture


ad