1 / 72

gluepy : A Simple Distributed Python Programming Framework for Complex Grid Environments

gluepy : A Simple Distributed Python Programming Framework for Complex Grid Environments. 8/1/08 Ken Hironaka, Hideo Saito, Kei Takahashi, Kenjiro Taura The University of Tokyo. Barriers of Grid Environments. Grid = Multiple Clusters (LAN/WAN) Complex environment Dynamic node joins

wiley
Download Presentation

gluepy : A Simple Distributed Python Programming Framework for Complex Grid Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gluepy:A Simple Distributed Python Programming Framework for Complex Grid Environments 8/1/08 Ken Hironaka, Hideo Saito, Kei Takahashi, KenjiroTaura The University of Tokyo www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  2. Barriers of Grid Environments • Grid = Multiple Clusters (LAN/WAN) • Complexenvironment • Dynamic node joins • Resource removal/failure • Network and nodes • Connectivity • NAT/firewall Fire Wall leave Grid enabled frameworks are crucial to facilitate computing in these environments join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  3. What type of applications? • Typical Usage • Standalone jobs • No interaction among nodes • Parallel and distributed Applications • Orchestrate nodes for a single application • Map an existing application on the Grid • Requires complex interaction ⇒frameworks must make it simple and manageable www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  4. Common Approaches(1) execute • Programming-less • Batch Scheduler • Task placement (inter-cluster) • Transparent retries on failure • Enables minimal interaction • Pass data via files/raw sockets • Embarrassingly parallel tasks • Very limited for application SUBMIT redo www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  5. Common Approaches(2) • Incorporate some user programming • e.g.:Master-Worker framework • Program the master/worker(s) • Job distribution • Handling worker join/leave • Error handling • Enables simpleinteraction • Still limited in application doJob() error() join() For more complex interaction (larger problem set) must allow more flexible/generalprogramming www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  6. The most flexible approach • Parallel Programming Languages • Extend existing languages: retains flexibility • Countless past examples • (MultiLisp[Halstead ‘85], JavaRMI, ProActive[Huet et al. ‘04], …) • Problem:not in context of the Grid • Node joins/leaves? • Resolve connectivity with NAT/firewall? • Coding becomes complex/overwhelming Can we not complement this? www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  7. Our Contribution • Grid-enableddistributed object-oriented framework • a focus on coping with complex environment • Joins, failures, connectivity • simpleProgramming& minimalConfiguration • Simple tool to act as a glue for the Grid • Implemented parallel applications on Grid environment with 900cores (9clusters) www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  8. Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  9. Programming-less frameworks • Condor/DAGMan [Thain et al. ‘05] • Batch scheduler • Transparent retires/ handle multiple clusters • Extremely limited interaction among nodes • Tasks with DAG dependencies • Pass on data using intermediate/scratch files Task Interaction using files Central Manager Assign Busy Nodes Cluster www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  10. “Restricted” Programming frameworks • Master-Worker Model: Jojo2 [Aoki et al. ‘06], OmniRPC [Sato et al. ‘01], Ninf-C [Nakata et al. ‘04], NetSolve [Casanova et al. ‘96] • Event driven master code: handle join/leave • Map-Reduce [Dean et al. ‘05] • define 2 functions: map(), reduce() • Partial retires when nodes fail • Ibis – Satin [Wrzesinska et al. ‘06] • Distributed divide-and-conquer • Random work stealing: accommodate join/leave • Effective for specialized problem sets • Specialize on a problem/model, made mapping/programming easy • For “unexpected models”, users have to resort to out-of-band/Ad-hoc means Join Handler Failure Handler Join fib(n) Map() divide Reduce() fib(n-1) Map() Reduce() Input Data Map() www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  11. Distributed Object Oriented frameworks foo.doJob(args) • ABCL [Yonezawa ‘90] JavaRMI, Manta [Maassen et al. ‘99] ProActive [Huet et al. ‘04] • Distributed Object oriented • Disperse objects among resources • Load delegation/distribution • Method invocations • RMI (Remote Method Invocation) • Async. RMIs for parallelism • RMI: • good abstraction • Extension of general language: • Allow flexible coding compute RMI foo Async. RMI www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  12. Hurdles for DOO on the Grid • Race conditions • Simultaneous RMIs on 1 object • Active Objects • 1 object = 1 thread • Deadlocks: e.g.: recursive calls • Handling asynchronous events • e.g., handling node joins • Why not event driven? • The flow of the program is segmented, and hard to flow • Handling joins/failures • Difficult to handle them transparently in a reasonable manner deadlock b.f() b a a.g() if join: add if done: give more … event Checkpoint? Automatic retry? … failure www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  13. Hurdles for Implementation NAT • Connecivity with NAT/firewall • Solution: Build an overlay • Existing implementations • ProActive [Huet et al. ‘04] • Tree topology overlay • User must hand write connectable points • Jojo2[Aoki et al. ‘06] • 2-level Hierarchical topology • SSH / UDP broadcast • assumes network topology/setting • out of user control • Requirements • Minimal user burden Configure each link Firewall Connection Configuration File www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  14. Summarization of the Problems • Distributed Object-Oriented on the Grid • Thread race conditions • Event handling • Node join/leave • underlying Connectivity www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  15. Proposal: gluepy • Grid enabled distributed object oriented framework • As a PythonLibrary • glue together Grid resources via simple and flexible coding • Resolve the issues in an object-oriented paradigm • SerialObjects • define “ownership” for objects • blocking operations unblock on events • Constructs for handling Node join/leave • Resolve the “first reference” problem • Failures are abstracted as exceptions • Connectivity(NAT/firewall) • Peers automatically construct an overlay www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  16. The Basic Programming Model Proc: A Proc: B • RemoteObjects • Created/mapped to a process • Accessible from other processes (RMI) • Passive Objects • Threads are not bound to objects • Thread • Simply to gain parallelism • RMIs / async. invocations (RMIs) implicitlyspawn a thread • Future • Returned for async. invocation • placeholder for result • Uncaught exception is stored and re-raised at collection a Spawn for RMI a.f() f() Proc a Spawn for async F = a.f() async f() store in F www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  17. Programming in gluepy inherit Remote Object • Basics: RemoteObject • Inherit Base class • Externally referenceable • Async. invocation with futures • No explicit threads • Easier to maintain sequential flow • mutual exclusion? events? ⇒ SerialObjects class Peer(RemoteObject): def run(self, arg): # work here… return result futures = [] for p in peers: f = p.run.future(arg) futures.append(f) waitall(futures) for f in futures: print f.get() async. RMI run() on all wait forallresults read forallresults www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  18. “ownership” with SerialObjects waiting threads owner thread object • SerialObjects • Objects with mutual exclusion • RemoteObjectsub-class • No explicit locks • Ownership for each object • call ⇒ acquire • return ⇒ release • Method execution by only 1 thread • The “owner thread” • Owner releases ownership on blocking operations • e.g: waitall(), RMI to other SerialObject • Pending threads contest for ownership • Arbitrary thread is scheduled • Eliminate deadlocks for recursive calls Th Th Th Th new owner thread object Th Th Th block Give-up Owner ship Th re-contest for ownership object Th Th Th Th www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy unblock

  19. Signals to SerialObjects • We don’t want event-driven loops! • Events → “signals” • Blockingop. unblock on signal • Signals to objects • Unblock a thread blocking in object’s context • If none, unblock a next blocking thread • Unblocked thread can handle the signal(event) object SIGNAL Th unblock handle object Th www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  20. SerialObjects in gluepy class DistQueue(SerialObject): def __init__(self): self.queue = [] def add(self, x): self.queue.append(x) if len(self.queue) == 1: self.signal() def pop(self): while len(self.queue) == 0: wait([]) x = self.queue.pop(0) return x • e.g.:A Queue • pop() • blocks on empty Queue • add() • call signal() to unblock waiter • Atomic Section: • Between blocking ops in a method • Can update obj. attr.s and do invocation on Non-Serial Objects Atomic Section Signal & wake Block until signal www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  21. Managing dynamic resources Objects in computation • Node Join: • Python process starts • Node leave: • Process termination • Constructs for node joins/leaves • Node Join ⇒“first reference” problem Object lookup • obtain ref. to existing objects in computation • Node Leave ⇒ RMI exception • Catch to handle failure lookup joining node Exception! Object on failed node www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  22. e.g.:Master-worker in gluepy (1/3) class Master(SerialObject): ... def nodeJoin(self, node): self.nodes.append(node) self.signal() def run (self): assigned = {} while True: while len(self.nodes)>0 and len(self.jobs)>0: ASYNC. RMIS TO IDLE WORKERS readys = wait(futures) if readys == None: continue for f in readys: HANDLE RESULTS • Handles join/leave • code for join: • join will invoke signal • signal will unblock main master thread Signal for join Block & Handle join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  23. e.g. :Master-worker in gluepy (2/3) for f in readys: node, job = assigned.pop(f) try: print ”done:”, f.get() self.nodes.append(node) except RemoteException, e: self.jobs.append(job) • Failure handling • Exception on collection • Handle exception to resubmit task Failure handling www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  24. e.g.: Master-worker in gluepy (3/3) • Deployment • Master exports object • Workers get reference and do RMI to join Master init master = Master() master.register(“master”) master.run() Worker init worker = Worker() master = RemoteRef(“master”) master.nodeJoin(worker) while True: sleep(1) lookup on join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  25. Automatic Overlay Construction(1) • Solution for Connectivity • Automatically construct an overlay • TCP overlay • On boot, acquire other peer info. • Each node connects to a small number of peers • Establish a connected connection graph NAT Global IP Firewall Attempt connection established connections www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  26. Automatic Overlay Construction(2) • Firewalled clusters • Automatic port-forwarding • User configure SSH info • Transparent routing • P-P communication is routed • (AODV [Perkins ‘97]) Firewall traversal SSH #config file use src_patdst_pat, prot=ssh, user=kenny P-to-P communication www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  27. RMI failure detection on Overlay RMI handler • Problem with overlay • A route consists of a number of connections • RMI failure ⇒ failure of any intermediate connection • Path Pointers • Recorded on each forwarding node • RMI replyreturns the path it came • Failure of intermediate connection • The preceding forwarding node back-propagates the failure Path pointer RMI invoker Backpropagate failure www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  28. Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  29. Experimental Environment InTrigger Grid Platform in Japan Max. scale:9clusters, over 900 cores requires SSH forwarding Global IPs istbs:316 tsubame:64 mirai:48 okubo:28 hongo:98 All packets dropped hiro:88 chiba:186 kyoto:70 suzuk:72 InTrigger imade:60 kototoi:88 Private IPs Firewall www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  30. Necessary Configuration • Configuration necessary for Overlay • 2clusters( tsubame, istbs) require SSH-portforwarding to other clusters ⇒ 2 lines of configuration add connection instruction by regular expression # istbs cluster uses SSH for inter-cluster conn. use 133\.11\.23\. (?!133\.11\.23\.), prot=ssh, user=kenny #tsubame cluster gateway uses SSH for inter-cluster conn. use 131.112.3.1 (?!172\.17\.), prot=ssh, user=kenny www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  31. Overlay Construction Simulation • Evaluate the overlay construction scheme • For different cluster configurations, modified number of attempted connections per peer • 1000 trials per each cluster/attempted connection configuration 28 Global/ 238 Private Peers Case: 95 % www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  32. Dynamic Master-Worker • Master object distributes work to Worker objects • 10,000tasksasRMI • Workers repeat join/leave • Tasks for failed nodes are redistributed • No tasks were lost during the experiment www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  33. A Real-life Application • A combination optimization problem • Permutation Flow Shop Problem • parallelbranch-and-bound • Master-Worker like • Requires periodic exchange of bounds • Code • 250 lines of Python code as glue code • Worker node starts up sequential C++ code • Communicate with local Python through pipes www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  34. Master-Workerinteraction • Master does RMI to worker • Worker: periodical RMI to master • Not your typical master-worker • requires a flexible framework like ours Master exchange_bound() doJob() Worker www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  35. Performance • Work Rate • ci: total comp. time per core • N: num. of cores • T: completion time • Slight drop with 950 cores • due to master node becoming overloaded www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  36. Troubleshoot Search Engine • Ever stuck debugging, or troubleshooting? • Re-rank query results obtained from google • Use results from machine learning web-forums • Perform natural language processing on page contents at query time • Use a Grid backend • Computationally intensive • Require good response time • in 10s of seconds Compute!! Compute!! backend Query: “vmware kernel panic” Search Engine Compute!! www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  37. Troubleshoot Search Engine Overview async. doQuery() Graph extraction Python CGI doSearch() rescoring parsing async. doWork() Leveraged sync/async RMIs to seamlessly integrate parallelism into a sequential program. Merged CGIs with Grid backend www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  38. Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  39. Conclusion • gluepy: Grid enabled distributed object oriented framework • Supports simple and flexible coding for complex Grid • SerialObjects • Signal semantics • Object lookup / exception on RMI failure • Automatic overlay construction • as a tool to glue together Grid resources simply and flexibly • Implemented and evaluated applications on the Grid • Max. scale: 900core (9 cluster) • NAT/Firewall, with runtime joins/leaves • Parallelized real-life applications • Take full advantage of gluepy constructs for seamless programming www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  40. Questions? • gluepy is available from its homepage www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  41. www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  42. Overlay Construction Simulation • Evaluate the overlay construction scheme • For different cluster configurations, modified number of attempted connections per peer • 1000 trials per each cluster/attempted connection configuration 28 Global/ 238 Private Peers Case: 95 % www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  43. Troubleshoot Search Engine Overview async. RMI for query Graph extraction extraction asynchronously return to CGI parsing rescoring RMI from CGI async. RMI to workers Leverage async. RMI from CGI script to work distribution on the Grid All coding done in Python seamlessly, using gluepy www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  44. Utilization of the Grid • Grid = Multiple Clusters (LAN/WAN) • Typical Usage • Many stand-alone jobs in parallel • Little or No interaction among nodes • Parallel and distributed Computing • Utilize nodes for a single application • Parallelize an existing application • Requires complex interaction ⇒ Utilize Grid enabled frameworks www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  45. The demands on the Grid • A framework that realizes flexible/complex interaction on the Grid • Can we learn anything from parallel languages? Fire Wall App. leave apply join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  46. Speedup 900 coresでスケールしなくなる www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  47. 累積実行時間 累積計算時間の拡大が再実行による無駄が出ていることが分かる 累積計算時間を考慮すると、スピードアップは169 cores ⇒ 948 cores (5.64 倍) で 4.94 www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  48. オーバレイに関するMicro-Benchmarks • 1ノードからRMIを発行 • ほぼすべてのノードに対して3hop以内で到達 • Latency • Overlay上でno-opのRMI : ping() • Bandwidth • Overlay上で大きい引数のRMI : send_data() www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  49. オーバレイ上の遅延 • 1ノードから 5クラスタのノード上のobjectへRMI : ping() • pingで測定したRTTと比較 • overlay, 1 hop = ~1.5 [ms] www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

  50. オーバレイ上のバンド幅 • 引数 (de)serialization overhead 大 • Full 1Gbit Ethernetで理想最大値: 40[MB/sec] • iperf測定値から算出される最大値と比較 • overlayでhopをするたびにバンド幅が減少 • store-and-forward Overlay hop数 でクラスタ内でもバンド幅が変化 www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy

More Related