1 / 129

Public Computing - Challenges and Solutions

Public Computing - Challenges and Solutions. Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007 May 21, 2007. Outlines. What is Grid Computing? Virtual Organizations Types of Grids Grid Components Applications Grid Issues

viho
Download Presentation

Public Computing - Challenges and Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Public Computing - Challenges and Solutions Yi Pan Professor and Chair of CS Professor of CIS Georgia State University Atlanta, Georgia, USA AINA 2007 May 21, 2007

  2. Outlines • What is Grid Computing? • Virtual Organizations • Types of Grids • Grid Components • Applications • Grid Issues • Conclusions

  3. Outlines -continued • Public Computing and the BOINC Architecture • Motivation for New Scheduling Strategies • Scheduling Algorithms • Testing Environment and Experiments • MD4 Password Hash Search • Avalanche Photodiode Gain and Impulse Response • Gene Sequence Alignment • Peer to Peer Model and Experiments • Conclusion and Future Research

  4. What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed

  5. What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power

  6. What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power • Standards define interface with grid

  7. What is Grid Computing? • Analogy is to power grid • Heterogeneous and geographically dispersed • Standards allow for transportation of power • Standards define interface with grid • Non-trivial overhead of managing movement and storage of power • Economies of scale compensate for this overhead allowing for cheap, accessible power

  8. A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way

  9. A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way • Standards allow for transportation of these services

  10. A Computational “Power Grid” • Goal is to make computation a utility • Computational power, data services, peripherals (Graphics accelerators, particle colliders) are provided in a heterogeneous, geographically dispersed way • Standards allow for transportation of these services • Standards define interface with grid • Architecture provides for management of resources and controlling access • Large amounts of computing power should be accessible from anywhere in the grid

  11. Virtual Organizations • Independent organizations come together to pool grid resources • Component organizations could be different research institutions, departments within a company, individuals donating computing time, or anything with resources • Formation of the VO should define participation levels, resources provided, expectations of resource use, accountability, economic issues such as charge for resources • Goal is to allow users to exploit resources throughout the VO transparently and efficiently

  12. Types of Grids • Computational Grid • Data Grid • Scavenging Grid • Peer-to-Peer • Public Computing

  13. Computational Grids • Traditionally used to connect high performance computers between organizations • Increases utilization of geographically dispersed computational resources • Provides more parallel computational power to individual applications than is feasible for a single organization • Most traditional grid project concentrate on these types of grids • Globus and OSGA

  14. Data Grids • Distributed data sources • Queries of distributed data • Sharing of storage and data management resources • D0-Partical Physics Data Grid allows access to both compute and data resources of huge amounts of physics data • Google

  15. Scavenging Grids • Harness idle cycles on systems especially user workstations • Parallel application must be quite granular to take advantage of large amounts of weak computing power • Grid system must support terminating and restarting work when systems cease idling • Condor system from University of Wisconsin

  16. Peer-to-Peer • Converging technology with traditional grids • Contrasts with grids having little infrastructure and high fault tolerance • Highly scalable for participation but difficult to locate and monitor resources • Current P2P like Gnutella, Freenet, FastTrack concentrate on data services

  17. Public Computing • Also converging with grid computing • Often communicates through a central server in contrast with peer-to-peer technologies • Again scalable with participation • Adds even greater impact of multiple administrative domains as participants are often untrusted and unaccountable

  18. Public Computing Examples • SETI@Home (http://setiathome.ssl.berkeley.edu/) – Search for Extraterrestrial Intelligence in radio telescope data (UC Berkeley) 搜索地外文明的分布式网络计算 • Has more than 5 million participants • “The most powerful computer, IBM's ASCI White, is rated at 12 TeraFLOPS and costs $110 million. SETI@home currently gets about 15 TeraFLOPs and has cost $500K so far.”

  19. More Public Computing Examples • Folding@Home project (http://folding.stanford.edu) for molecular simulation aimed at new drug discovery • Distributed.net (http://distributed.net) for cracking RC5 64-bit encryption algorithm – used more than 300,000 nodes over 1757 days

  20. Grid Components • Authentication and Authorization • Resource Information Service • Monitoring • Scheduler • Fault Tolerance • Communication Infrastructure

  21. Authentication and Authorization • Important for allowing users to cross the administrative boundaries in a virtual organization • System security for jobs outside the administrative domain currently rudimentary • Work being done on sandboxing, better job control, development environments

  22. Resource Information Service • Used in resource discovery • Leverages existing technologies such as LDAP, UDDI • Information service must be able to report very current availability and load data • Balanced with overhead of updating data

  23. Monitoring • Raw performance characteristics are not the only measurement of resource performance • Current and expected loads can have a tremendous impact • Balance between accurate performance data and additional overhead of monitoring systems and tracking that data

  24. Scheduler • Owners of systems interested in maximizing throughput • Users interested in maximizing runtime performance • Both offer challenges with crossing administrative boundaries • Unique issues such as co-allocation and co-location • Interesting work being done in scheduling like market based scheduling

  25. Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure

  26. Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure

  27. Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure

  28. Fault Tolerance • More work exploring fault tolerance in grid systems leveraging peer-to-peer and public computing research • Multiple administrative domains in VO challenge the reliability of resources • Faults can refer not only to resource failure but violation of service level agreements (SLA) • Impact on fault tolerance if there is no accountability for failure

  29. Communication Infrastructure • Currently most grids have robust communication infrastructure • As more grids are deployed and used, more concentration must be done on network QoS and reservation • Most large applications are currently data rich • P2P and Public Computing have experience in communication poor environments

  30. Applications • Embarrassingly parallel, data poor applications in the case of pooling large amounts of weak computing power • Huge data-intensive, data rich applications that can take advantage of multiple, parallel supercomputers • Application specific grids like Cactus and Nimrod

  31. Grid Issues • Site autonomy • Heterogeneous resources • Co-allocation • Metrics for resource allocation • Language for utilizing grids • Reliability

  32. Site autonomy • Each component of the grid could be administered by an individual organization participating in the VO • Each administrative domain has its own policies and procedures surrounding their resources • Most scheduling and resource management work must be distributed to support this

  33. Heterogeneous resources • Grid resources will have not only heterogeneous platforms but heterogeneous workloads • Applications truly exploiting grid resources will need to scale from idle cycles on workstations, huge vector based HPCs, to clusters • Not only computation power, also storage, peripherals, reservability, availability, network connectivity

  34. Co-allocation • Unique challenges of reserving multiple resources across administrative domains • Capabilities of resource management may be different for each component of a composite resource • Failure of allocating components must be handled in a transaction-like manner • Acceptable substitute components may assist in co-allocating a composite resource

  35. Metrics for resource allocation • Different scheduling approaches are measure performance differently • Historical performance • Throughput • Storage • Network connectivity • Cost • Application specific performance • Service level

  36. Language for utilizing grids • Much of the work in grids is protocol or language work • Expressive languages needed for negotiating service level, reporting performance or resource capabilities, security, and reserving resources • Protocol work in authentication and authorization, data transfer, and job management

  37. Summary about Grids • Grids offer tremendous computation and data storage resources not available in single systems or single clusters • Application and algorithm design and deployment still either rudimentary or application specific • Universal infrastructure still in development • Unique challenges still unsolved especially in regard to fault tolerance and multiple administrative domains

  38. Public Computing • Aggregates idle workstations connected to the Internet for performing large scale computations • Initially seen in volunteer projects such as Distributed.net and SETI@home • Volunteer computers periodically download work from a project server and complete the work during idle periods • Currently used in projects that have large workloads on the scale of months or years with trivially parallelizable tasks

  39. BOINC Architecture • Berkeley Open Infrastructure for Network Computing • Developed as a generic public computing framework • Next generation architecture for the SETI@home project • Open source and encourages use in other public computing projects

  40. BOINC lets you donate computing power to the following projects • Climateprediction.net: study climate change • Einstein@home: search for gravitational signals emitted by pulsars • LHC@home: improve the design of the CERN LHC particle accelerator • Predictor@home: investigate protein-related diseases • SETI@home: Look for radio evidence of extraterrestrial life • Cell Computing biomedical research (Japanese; requires nonstandard client software)

  41. BOINC Architecture

  42. Motivation for New Scheduling Strategies • Many projects requiring large scale computational resources not of the current public computing scale • Grid and cluster scale projects are very popular in many scientific computing areas • Current public computing scheduling does not scale down to these smaller projects

  43. Motivation for New Scheduling Strategies • Grid scale scheduling for public computing would make public computers a viable alternative or complimentary resource to grid systems • Public computing has the potential to offer a tremendous amount of computing resources from idle systems of organizations or volunteers • Scavenging grid projects such as Condor indicate interest in harnessing these resources in the grid research community

  44. Scheduling Algorithms • Current BOINC scheduling algorithm • New scheduling algorithms • First Come, First Serve with target workload of 1 workunit (FCFS-1) • First Come, First Serve with target workload of 5 workunits (FCFS-5) • Ant Colony Scheduling Algorithm

  45. BOINC Scheduling • Originally designed for “unlimited” work • Clients can request as much work as desired up to a specified limit • Smaller, limited computational jobs faced with the challenge of more accurate scheduling • Too many workunits assigned to a node leads to either redundant computation by other nodes or exhaustion of available workunits • Too few workunits assigned leads to increased communication overhead

  46. New Scheduling Strategies • New strategies target computational problems on the scale of many hours or days • Four primary goals: • Reduce application execution time • Increase resource utilization • No reliance on client supplied information • Remain application neutral

  47. First Come First Serve Algorithms • Naïve scheduling algorithms based solely on the frequency of client requests for work • Server-centric approach which does not depend on client supplied information for scheduling • At each request for work, the server compares the number of workunits already assigned to a node and sends work to the node based on a target worklevel • Two algorithms tested targeting either a workload of one workunit (FCFS-1) or five workunits (FCFS-5)

  48. Ant Colony Algorithms • Meta-heuristic modeling the behavior of ants searching for food • Ants make decisions based on pheromone levels • Decisions affect pheromone levels to influence future decisions ?

  49. Ant Colony Algorithms • Initial decisions are made at random • Ants leave trail of pheromones along their path • Next ants use pheromone levels to decide • Still random since initial trails were random ?

  50. Ant Colony Algorithms • Shorter paths will complete quicker leading to feedback from the pheromone trail • Ant at destination now bases return decision on pheromone level • Decisions begin to become ordered ? ?

More Related