...
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

広域並列分散プログラミングのための高性能デッドロックフリーオーバーレイ PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

広域並列分散プログラミングのための高性能デッドロックフリーオーバーレイ. 田浦研究室 076426 弘中健. 広域環境における並列分散計算. 複数のクラスタで並列計算に使う機会が増加 WAN のバンド幅増加 複数のクラスタを WAN で接続 した環境が普及 Grid5000( フランス ), DAS-3( オランダ ), InTrigger ( 日本 ) 並列分散計算の増加 並列ライブラリを用いたアプリ 組み合わせ最適化問題 モデルチェッキング データインテンシブなアプリ 大量のデータ解析を並列化. WAN. Cluster.

Download Presentation

広域並列分散プログラミングのための高性能デッドロックフリーオーバーレイ

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


6486555

076426


6486555

    • WAN

    • WAN

      • Grid5000(), DAS-3(), InTrigger()

WAN

Cluster


6486555

NAT/firewall

  • WAN

    • NAT, Firewall

    • WAN/LAN

Too many

Contention!


6486555

    • UDP/TCP

Cluster

firewall Cluster


6486555

  • WAN

    • 47291


6486555

    • MPI,

  • /

    • (JavaRMI), (MapReduce, Make)

    • P2P (DHT, )

Parallel and Distributed Applications

Programming Languages, Libraries,

Frameworks, Middle-wares

Application-level Overlay

LAN/WAN, NAT, firewall, scalability


6486555


6486555

    • 100[us] ~ 100[ms]

    • 10 [Mbps] 10[Gbps]

10 [Gbps]

10 [Mbps]

10 [Gbps]

Narrow Link


6486555

1:

src

dst

1Gbps

1Gbps

WAN Link


6486555

2:

packet

buffer

FULL!

src

dst


6486555

  • 4

    • Link A Link B

    • Link B Link C

    • Link C Link D

    • Link D Link A

Link: A

Link: D

Link: C

Link: B


6486555

Link: A

Link: D

Link: B

Link: C


6486555

3:


6486555


6486555

  • RON (Resilient Overlay Network) [Andersen et al. 01]

    • UDP

    • UDP

  • DiskRouter[Kola et al. 03]


6486555

src

dst

  • UDP + End-End

  • [Kar et al. 01]

    • ACK

    • ACKpiggyback

  • Spines : [Amir et al. 02]

    • TCP

    • 1 GigabitEther 300[Mbps]

feedback

UDP

UDP

UDP

TCP

+


6486555

    • [Antonio et al. 94]

    • [Dally et al. 87, 93]

    • Up/Down Routing [Schroeder et al. 91]

    • Ordered-link Routing [Chiu et al. 02]

    • L-Turn Routing [Koibuchi et al. 01]


6486555


6486555

    • TCP


6486555

    • TCPTCP

    • srcdstFIFO

    • TCP

packet

buffer

FULL!

src

dst

TCP

TCP

TCP


Up down routing schroeder et al 91

Up/Down Routing [Schroeder et al. 91]

DOWN

    • ID

  • ID

    • ID

    • UP:

    • DOWN:

    • DOWNUP

i

j

i > j

UP

DOWN

0

UP

2

DOWN

UP

1

6

UP

UP

5

DOWN

DOWN

4

3

DOWN

Down Up


6486555

    • TCP


6486555

[Saito et al. 07]

    • [d^k, d^k+1 )d

    • NNlogN

d

d^2

d^3


Up down

Up/Down

  • ID

  • UPDOWN

  • ID

    • UPWAN

    • UPDOWN

      WAN

0

UP

DOWN

1

2

3

5

4

cluster


Up down1

Up/Down

0

  • ID

  • Rationale

    • UPDOWN

    • UP or DOWN

1

3

4

UP

DOWN

5


6486555

0

0

UP

DOWN

DOWN

1

1

2

5

DOWN

3

2

5

4

4

3

cluster

cluster

Locality-aware

DFS-updown

BFS-updown


6486555

B1

B2

B3

src

dst


6486555

(1/2)

  • TCP

    • 1 packet

    • 1packet

Send buffer

FULL!

Recv buffer


6486555

(2/2)

    • packet


6486555


Deadlock free

Deadlock-free

  • Deadlock

    • ordered-link

    • Up/Down

    • Up/Down

    • 13 (515)


6486555

Average Hops

Max. Hops

Deadlock

Deadlock

  • Up/Down


6486555

  • deadlock-freeWAN

  • WAN


Deadlock free1

Deadlock-free

  • 7

    (170)

    • : 9%

  • Deadlock-free


6486555

vs.

Up/Down

LANWAN

Up/Down

  • Up/DownWAN


6486555

    • 1 Gigabit Ethernet LAN (940 [Mbps])

    • Myrinet 10G LAN (7 [Gbps])


6486555

GbE (940[Mbps])

Myrinet (7[Gbps])

  • TCP

  • Myrinet4.5[Gbps]


6486555

    • Gather, All-to-All

    • LAN: 1-switch (36nodes), (177 nodes)

    • WAN: 4 clusters (291 nodes)


Gather

Gather

1-switch (36)

  • switch:

    • Packet-loss

    • TCP:

    • 200 [ms] loss

  • :

TCP RTO: 200 [ms]


Gather1

Gather

4 (291 )


All to all

All-to-All

      • 177

      • MPICH()

    • WAN4

      • 291

4Gbps

4Gbps

1Gbps

Cluster

Cluster

1Gbps

1Gbps

1Gbps

1Gbps

1Gbps

Cluster

Cluster


All to all1

All-to-All

1 (177 )

4 (291 )

  • WAN


6486555


6486555

  • WAN

    • LAN/WAN


6486555


6486555

  • (1)

    • High Performance Wide-area Overlay using Deadlock-free Routing. High Performance Distributed Computing(HPDC), 2009

  • (2)

    • Vol.1 No.2 (ACS 23), pp.157-168, 20088

    • A Low-stretch Object Migration Scheme for Wide-area Environments. IPSJ Transactions on Programming. Vol.48 No.SIG 12 (PRO 34), pp.28-40, August 2007.

  • (2)

    • gluepy : A Simple Distributed Python Framework for Complex Grid Environments. At 21st Annual International Workshop on Languages and Compilers for Parallel Computing (LCPC2008). LNCS Vol.5335, pp.249-263, July 2008.

    • (SACSIS 2008)pp.349-35820085

  • (2)

    • TCP. OS-109 (SWoPP 2008)pp.9-1520088

    • . OS-106 (SWoPP 2007)pp.71-7820078

  • (4), (3)


Gluepy

: gluepy

  • GridPython

    • glue

    • WAN

    • RMI

      (Remote Method Invocation)

      • (NAT/firewall)


6486555

Proc: A

Proc: B

    • RMI (Remote Method Invocation)

    • RMI

a

a.f()

f()

RMI

Proc: A

Proc: B

Proc: B

Proc: B

a

a.f()

a

a.f()

a

a.f()

async.

RMI

f()

f()

f()


6486555

Objects in

computation

    • Object lookup

    • object

  • RMI

    • rollback

lookup

New object on joining node

Exception!

Object on failed node


6486555

(1)

  • TCP

NAT

Global IP

Firewall

Attempt connection

established connections


6486555

(2)

  • Firewall

    • port-forwarding

    • SSH

    • P-P

      • AODV [Perkins 97]

Firewall

traversal

SSH

#config file

use src_patdst_pat, prot=ssh, user=kenny

P-to-P

communication


Programming in gluepy

Programming in gluepy

inherit Remote Object

  • RemoteObject

    • Base class

    • RMI

  • futureRMI

    • placeholder

    • flow

class Peer(RemoteObject):

def run(self, arg):

# work here

return result

futures = []

for p in peers:

f = p.run.future(arg)

futures.append(f)

waitall(futures)

for f in futures:

print f.get()

async. RMI

run() on all

wait forallresults

read forallresults


Serialobject

SerialObject

waiting threads

owner

thread

object

  • SerialObjects

    • RemoteObjectsub-class

    • call acquire

    • return release

    • 1

    • e.g: waitall(), Serial Object

    • deadlock

Th

Th

Th

Th

new

owner

thread

object

Th

Th

Th

block

Give-up

Owner

ship

Th

re-contest

for ownership

object

Th

Th

Th

Th

unblock


Serialobject1

SerialObject

      • Unix

      • Blockingunblock

    • contextblock1unblock

      • block

    • Unblock

object

SIGNAL

Th

unblock

handle

object

Th


Serialobjects in gluepy

SerialObjects in gluepy

class DistQueue(SerialObject):

def __init__(self):

self.queue = []

def add(self, x):

self.queue.append(x)

if len(self.queue) == 1:

self.signal()

def pop(self):

while len(self.queue) == 0:

wait([])

x = self.queue.pop(0)

return x

  • Atomic Section:

    • stateNon-SerialObjectatomic

  • Queue

    • queuepop()block

    • add()

      • signal()

        unblock

Atomic Section

Signal & wake

Block until signal


Master worker in gluepy 1 3

Master-worker in gluepy (1/3)

class Master(SerialObject):

...

def nodeJoin(self , node):

self.nodes.append(node)

self.signal()

def run (self):

assigned = {}

while True:

while len(self.nodes)>0 and

len(self.jobs)>0:

ASYNC. RMIS TO IDLE WORKERS

readys = wait(futures)

if readys == None: continue

for f in readys:

HANDLE RESULTS

    • blocksignal

      Noneunblock

Signal for join

Block &

Handle join


Master worker in gluepy 2 3

Master-worker in gluepy (2/3)

for f in readys:

node, job = assigned.pop(f)

try:

print done:, f.get()

self.nodes.append(node)

except RemoteException, e:

self.jobs.append(job)

Failure

handling


Master worker in gluepy 3 3

Master-worker in gluepy (3/3)

    • RMI

Master init

master = Master()

master.register(master)

master.run()

Worker init

worker = Worker()

master = RemoteRef(master)

master.nodeJoin(worker)

while True:

sleep(1)

lookup on join


6486555

Proc: A

Proc: B

  • ABCL [Yonezawa 90]

    JavaRMI, Manta [Maassen et al. 99]

    ProActive [Huet et al. 04]

    • RMI (Remote Method Invocation)

    • RMI

a

a.f()

f()

RMI

Proc: A

Proc: B

Proc: B

Proc: B

a

a.f()

a

a.f()

a

a.f()

async.

RMI

f()

f()

f()


6486555

Grid

Proc: A

Proc: B

Proc: A

Proc: A

    • 1RMI

    • Active Objects

      • 1 object = 1 thread

      • e.g.:

      • Event drivenflow

a

a.f()

a.f()

a.f()

f()

f()

f()

race

Active

objects

a

b

b.f()

f()

a.g()

deadlock


The basic programming model

The Basic Programming Model

Proc: A

Proc: B

    • RMI

    • Passive Objects

    • RMI

  • Future

    • RMI

    • placeholder

a

Spawn for RMI

a.f()

f()

Proc

a

Spawn for async

F = a.f() async

f()

store in F


6486555

CDF


All to all2

All-to-All:


100 mbps

100 [Mbps]


6486555

Application-level Overlay

    • stateful firewall

TCP/UDP

Link

firewall Cluster

Cluster


6486555

buffer

buffer

buffer

src

dst


6486555

    • best effort

    • best effort

buffer

buffer

src

dst

1Gbps

1Gbps

1Gbps

src

dst

1Gbps

1Gbps

WAN Link


6486555

(1/2)

  • TCP

    • best effort

    • TCP

FULL!

FULL!

src

dst


6486555

(2/2)

    • AB

    • B

      A

    • AB

      • AB

FULL!

FULL!

Link: A

Link: B

waits


Dijsktra like 1 3

Dijsktra-like(1/3)

    • Rrank

      • UpdownR=2

    • R

      • (node-id, rank-id)

    • 2

      • rank

      • Ordered-link

      • Updown

        • Up: 0

        • Down1

(nid, R-1)

nid

(nid, 0)

r

r

r

0

1


Dijsktra like 2 3

Dijsktra-like(2/3)

    • (nid, r)r

    • (n0, r)rn1

      (n1, r)

    • src

      • (src, 0), (src, R-1)0

r'

(nid, r)

(nid, r)

(nid, 0)


Dijsktra like 3 3

Dijsktra-like(3/3)

  • Dijkstra: |E| + Vlog|V|

    • decrease-key: O(1)

    • :R

    • |E| = eR

    • |V| = nR

  • eR + (nR) log(nR)

    • ():R = n


  • Login