1 / 102

Decentralizing Grids

Decentralizing Grids. Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007. Roadmap. Background The problem space Some early solutions Research frontier/opportunities Wrapup. Background. Grids are distributed … but also centralized

ingrid-pate
Download Presentation

Decentralizing Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007

  2. Roadmap • Background • The problem space • Some early solutions • Research frontier/opportunities • Wrapup

  3. Background • Grids are distributed … but also centralized • Condor, Globus, BOINC, Grid Services, VOs • Why? client-server based • Centralization pros • Security, policy, global resource management • Decentralization pros • Reliability, dynamic, flexible, scalable • **Fertile CS research frontier**

  4. Challenges • May have to live within the Grid ecosystem • Condor, Globus, Grid services, VOs, etc. • First principle approaches are risky (Legion) • 50K foot view • How to decentralize Grids yet retain their existing features? • High performance, workflows, performance prediction, etc.

  5. Decentralized Grid platform • Minimal assumptions about each “node” • Nodes have associated “assets” (A) • basic: CPU, memory, disk, etc. • complex: application services • exposed interface to assets: OS, Condor, BOINC, Web service • Nodes may up or down • Node trust is not a given (do X, does Y instead) • Nodes may connect to other nodes or not • Nodes may be aggregates • Grid may be large > 100K nodes, scalability is key

  6. Grid Overlay Condor network Grid service Raw – OS services BOINC network

  7. Grid Overlay - Join Condor network Grid service Raw – OS services BOINC network

  8. Grid Overlay - Departure Condor network Grid service Raw – OS services BOINC network

  9. Routing = Discovery discover A Query contains sufficient information to locate a node: RSL, ClassAd, etc Exact match or semantic match

  10. Routing = Discovery bingo!

  11. Routing = Discovery Discovered node returns a handle sufficient for the “client” to interact with it - perform service invocation, job/data transmission, etc

  12. Routing = Discovery • Three parties • initiator of discovery events for A • client: invocation, health of A • node offering A • Often initiatorand client will be the same • Other times client will be determined dynamically • if W is a web service and results are returned to a calling client, want to locate CW near W => • discover W, then CW !

  13. Routing = Discovery X discover A

  14. Routing = Discovery

  15. Routing = Discovery bingo!

  16. Routing = Discovery

  17. Routing = Discovery outside client

  18. Routing = Discovery discover A’s

  19. Routing = Discovery

  20. Grid Overlay • This generalizes … • Resource query (query contains job requirements) • Looks like decentralized “matchmaking” • These are the easy cases … • independent simple queries • find a CPU with characteristics x, y, z • find 100 CPUs each with x, y, z • suppose queries are complex or related? • find N CPUs with aggregate power = G Gflops • locate an asset near a prior discovered asset

  21. Grid Scenarios • Grid applications are more challenging • Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies • individual job/task needs a resource near a data source • workflow • queries are not independent • Metrics are collective • not simply raw throughput • makespan • response • QoS

  22. Related Work • Maryland/Purdue • matchmaking • Oregon-CCOF • time-zone CAN

  23. Related Work (cont’d) None of these approaches address the Grid scenarios (in a decentralized manner) • Complex multi-task data/control dependencies • Collective metrics

  24. 50K Ft Research Issues • Overlay Architecture • structured, unstructured, hybrid • what is the right architecture? • Decentralized control/data dependencies • how to do it? • Reliability • how to achieve it? • Collective metrics • how to achieve them?

  25. = component service request job task … Context: Application Model answer = data source

  26. Context: Application Models Reliability Collective metrics Data dependence Control dependence

  27. Context: Environment • RIDGE project - ridge.cs.umn.edu • reliable infrastructure for donation grid envs • Live deployment on PlanetLab – planet-lab.org • 700 nodes spanning 335 sites and 35 countries • emulators and simulators • Applications • BLAST • Traffic planning • Image comparison

  28. Application Models Reliability Collective metrics Data dependence Control dependence

  29. G B Reliability Example C E D B G

  30. G B Reliability Example C E D B CG G CG responsible for G’s health

  31. G B Reliability Example C E D B G, loc(CG ) CG

  32. G B Reliability Example C E D B G CG could also discover G then CG

  33. G B Reliability Example C E D X B CG

  34. G B Reliability Example C E D G. … CG

  35. G B Reliability Example C E D G CG

  36. G B Client Replication C E D B G

  37. G B Client Replication C E D B G CG2 CG1 loc (G), loc (CG1), loc (CG2) propagated

  38. G B Client Replication C E D B G CG2 X CG1 client “hand-off” depends on nature of G and interaction

  39. G B Component Replication C E D B G

  40. G B Component Replication C E D G2 G1 CG

  41. Replication Research • Nodes are unreliable – crash, hacked, churn, malicious, slow, etc. • How many replicas? • too many – waste of resources • too few – application suffers

  42. System Model • Reputation rating ri– degree of node reliability • Dynamically size the redundancy based on ri • Nodes are not connected and check-in to a central server • Note: variable sized groups 0.9 0.8 0.8 0.7 0.7 0.4 0.3 0.4 0.8 0.8

  43. Reputation-based Scheduling • Reputation rating • Techniques for estimating reliability based on past interactions • Reputation-based scheduling algorithms • Using reliabilities for allocating work • Relies on a success threshold parameter

  44. Algorithm Space • How many replicas? • first-, best-fit, random, fixed, … • algorithms compute how many replicas to meet a success threshold • How to reach consensus? • M-first (better for timeliness) • Majority (better for byzantine threats)

  45. Experimental Results: correctness This was a simulation based on byzantine behavior … majority voting

  46. Experimental Results: timeliness M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

  47. Next steps • Nodes are decentralized, but not trust management! • Need a peer-based trust exchange framework • Stanford: Eigentrust project – local exchange until network converges to a global state

  48. Application Models Reliability Collective metrics Data dependence Control dependence

  49. BLAST Collective Metrics • Throughput not always the best metric • Response, completion time, application-centric • makespan - response

  50. Communication Makespan • Nodes download data from replicated data nodes • Nodes choose “data servers” independently (decentralized) • Minimize the maximum download time for all worker nodes (communication makespan) data download dominates

More Related