Debunking then duplicating ultracomputer performance claims by debugging the combining switches
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches PowerPoint PPT Presentation


  • 39 Views
  • Uploaded on
  • Presentation posted in: General

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches. Eric Freudenthal and Allan Gottlieb {freudenthal, [email protected] Talk Summary. Review Ultracomputer combining networks

Download Presentation

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Debunking then duplicating ultracomputer performance claims by debugging the combining switches

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

Eric Freudenthal and Allan Gottlieb

{freudenthal, [email protected]


Talk summary

Talk Summary

  • Review Ultracomputer combining networks

    • MIMD architecture expected to provide high performance for hot spot traffic & centralized coordination

  • Duplicating & debunking

    • High hot spot latency, slow centralized coord.

    • Why?

  • Minor improvements to architecture

    • Significantly reduced hot spot latency

    • Improved coordination performance


2 3 pe computer with omega network

23 PE computer with omega network

PE7

PE6

PE2

PE5

PE3

PE1

PE0

PE4

MM7

MM6

MM1

MM0

MM2

MM3

MM4

MM5

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

NUMA Connections

MemoryModules

ProcessingElements

Switches

Routing:

20

21

22

“Dance Hall”

(All Processors equally distant from all Memory Modules)

“Budoir”

(Processors & Memory Modules can be co-resident)


Network congestion due to polling of single variable in mm3

Network congestion due to polling of single variable in MM3

PE7

PE6

PE5

PE4

PE3

PE2

PE0

MM1

MM7

MM6

MM0

MM4

MM3

MM5

MM2

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

PE1

SW

  • Each PE has single outstanding reference to same variable.

    • Low offered load

  • These references serialize at MM3

    • Switch queues in “funnel” near MM3 (in red) fill

    • High memory latency results

  • If switches could “combine” references to a single variable

    • A single MM operation would satisfy multiple requests

    • Lower network congestion & latency

    • NYU Ultracomputer does this


Fetch and add

Fetch-and-add

FAA(X, e)

  • Atomic operation

  • Fetches old value of X and adds e to X

  • Useful for busy waiting coordination

  • Ultracomputer switches combine FAAs

  • FAA(X,0) is equivalent to load X


Combining of fetch add and loads

Combining of Fetch & Add (and loads)

FAA(X,1)

X:12

FAA(X,15)

FAA(X,3)

FAA(X,2)

1U

X:12

X:13

X:0

FAA(X,12)

FAA(X,4)

4U

X:0

X:0

FAA(X,8)

X:4

Semantics equivalent to some serialization.

Start: X=0

12L

End: X=15

MM

“wait buffer”Lower port first, its addend=12


Coordination with fetch and add

Coordination with fetch-and-add

Spin-locks:

Shared int L = 1

lock():

while (faa(L,-1) < 1)

faa(L,+1)

while (L < 1) ;

unlock():

faa(L,+1)

Readers and Writers:

constant int p = max readers

Shared int C = p // p resources

Reader() { // take 1 instance

while (faa(C,-1) < 1)

faa(C,+1)

while (C < 1);

read()

faa(C,+1)

Writer() // take all p instances

while (faa(C,-p) < p)

faa(C,+p)

while (C< p);

write()

faa(C,+p)


Characteristics of faa centralized coordination algorithms

Characteristics of FAA Centralized Coordination Algorithms

  • Many faa coord algs reference a small number of shared variables.

  • Spin-locks and r/w reference one

  • Uncontended spin- and r/w-lock generates one shared access

    • Including multiple readers in absence of writers

  • FAA barrier and queue algorithms have similar characteristics


Combining queue design

Combining Queue Design

chute

chute

chute

chute

in

in

in

in

out

out

out

out

in

out

in

in

out

out

in

out

ALU

ALU

ALU

ALU

No associative memory required

Background: Guibas & Liang Systolic FIFO

Ultracomputer Combining Queue


Summary of baseline ultracomputer

Summary of Baseline Ultracomputer

  • Architecture reasonable and motivated

    • Switches not prohibitively expensive

    • Serialization-free coordination algorithms

  • Queues in switches permit high bandwidth

    • Low latency for random & mixed hot spot traffic

  • NYU simulations (surprisingly) did not include 100% hot spot traffic

    • (Lee Kruskal Kuck did, but with different flow control)

    • In fact combining helpful, but not as good as expected

    • Queues near hot memory fill; others nearly empty

      • Non-trivial queuing delays

    • Combining only in full queues

      • Low message “multiplicity”


Rest of this talk

Rest of this talk

  • Debunking: High latency despite Ultra3 flow control

    • Algorithms that minimize hot spot traffic outperform centralized.

  • Deconstructing: Understanding of high latency

    • Reduced combining due to wait buffer exhaustion

    • Queuing delays in network – reduced Q capacity helps

  • Debugging: Improvements to combining switches

    • Larger wait buffer needed

    • Adaptive reduction of queue capacity when combining occurs

  • Duplication: Centralized algorithms competitive

    • Much superior for concurrent-access locks


Ultra iii baseline switches memory latency one request pe

Ultra III “baseline” switchesMemory Latency, one request / PE

100%, no combining

100%

~4x

40%

~2x

20%

0-10%

ideal

% hot spot


Two fixes to ultra iii switch design

Two “Fixes” to Ultra III Switch Design

  • Problem: Full wait buffers reduce combining

    • “Sufficient” waitbuf capacity → 45% latency reduction

  • Problem: Congestion in “combining funnel”

    • Shortened queues → backpressure

      • Lower per-stage queuing delays

      • More non-empty queues

        • more combining, hence higher message “multiplicity”

      • Reduces latency another 30%;

      • FAA algs now competitive


What is the best queue length

What is the “Best” queue length

  • Problem

    • Non-hot spot latency benefits fromlargequeues

    • Hot-spot latency benefits fromsmallqueues

  • Solution

    • Detect switches engaged in combining

      • Multiple combined messages awaiting transmission

    • Adaptively reduce capacity of these switches

      • Other switches unaffected

  • Results

    • Reduced polling latency, good non-poll latency


Memory latency 1024 pe systems over a range of accepted load

Memory latency, 1024 PE SystemsOver a range of accepted load

  • Baseline Ultra III switch

    • Limited wait buffer

    • Fixed queue size

  • Waitbuf100

    • Baseline

    • Sufficient wait buffer

  • Improved

    • Waitbuf100

    • Adaptive queue length

  • Aggressive

    • Improved

    • Combines from both ports & on first slice

    • Potential clock rate reduction

100% hot

20% hot

Uniform


Mellor crummey scott mcs local spin coordination

Mellor-Crummey & Scott (MCS):Local-spin coordination

  • No hot spot polling

    • Each PE spins on distinct shared var in co-located MM

    • Other parts of algorithm may generate hot spot traffic

  • Serialization-free barriers

    • Barrier satisfaction “disseminated” without generating hotspot traffic

    • Each processor has log2(N) rendezvous

  • Locks: Global state in hot spot variables

    • Heads of linked lists (blocked requestors)

    • Count of readers

    • Hot spot accesses benefit from combining


Synchronization barriers mcs also serialization free

Synchronization: BarriersMCS also serialization-free

IntenseLoop:

barrier

RealisticLoop:

Ref 15 or 30 shared vars

barrier

Better


Reader writer experiment

Reader-Writer Experiment

  • Loop:

    Determine if reader or writer

    “Sleep” for 100 cycles

    Lock

    Reference 10 shared variables

    Unlock

  • Reader-writer mix

    • All reader, all writer

    • 1 expected writer

      • P(writer) = 1/N

  • Plots on next slides

    • Rate readers and writer locks granted (unit=rate/kc)

    • Greater values indicate greater progress


All readers all writers

All Readers / All Writers

All Readers

Combining helps MCS

Serialization-free (FAA algorithm) faster

All Writers

Essentially a highly contended semaphore

Only aggressive competes

Better


1 expected writer

1 Expected Writer

Reader performance

FAA faster

MCS benefits from combining

Writer performance

FAA generally faster

MCS benefits from combining

Better


Conclusions

Conclusions

  • “Improved” architecture superior

    • Large wait buffers decrease hot spot latency

    • Adaptive Q capacity decreases latency

      • General technique?

  • Performance of FAA Algorithms

    • R/W competitive with MCS

      • Much superior when readers dominate

      • Require combining.

    • Barrier near MCS

      • Faster with aggressive design


Relevance future work

Relevance & Future Work

  • Large shared memory systems are manufactured

    • Combining not restricted to omega network

      • Return messages must be routed to combine sites

    • Combining demonstrated as useful for inter-process coordination.

  • Application of adaptive queue capacity modulation to other domains

    • Such as responding to flash-flood & DOS traffic

  • Analytic model of queuing delays for hot spot combining under development


Difficulties with aggressive 2 input coupled queues

Difficulties with aggressive(2-input, coupled) queues

Single input queues simpler

Dual input combining queue built from two single-input combining queues

Messages from different ports ineligible for combining

DecoupledALUs

Idea: remove ALU from transmission path

Shorter clock intervals

max(transmission, ALU)

Head item can not combine

Combining less likely

≥ 3 enqueued messages

ALU

ALU

ALU

ALU

mux


Debunking then duplicating ultracomputer performance claims by debugging the combining switches

END

  • Questions?


  • Login