1 / 45

Jongjun Son Sun Microsystems , korea

Sun's Infrastructure Solution for Grid Engine. Jongjun Son Sun Microsystems , korea. http://sun.com/grid. Agenda. Sun's Grid Strategy Sun's Grid Software N1 Grid Engine 6 Technical Overview. Sun's Grid Strategy. Sun's Grid Computing Approach. A flexible and scalable architecture

miracle
Download Presentation

Jongjun Son Sun Microsystems , korea

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sun's Infrastructure Solution forGrid Engine Jongjun Son Sun Microsystems , korea http://sun.com/grid

  2. Agenda • Sun's Grid Strategy • Sun's Grid Software • N1 Grid Engine 6 • Technical Overview

  3. Sun's Grid Strategy

  4. Sun's Grid Computing Approach • A flexible and scalable architecture • Pools computing resources to solve important problems • Collects unused capacity for better utilization • Architecture for seamless addition of resources • Up to hundreds or thousands of processors and systems • Multi-platform, Multi-OS • Distributed resource management (DRM) • Distributed system and software management A well-designed Grid Computing infrastructure is accessed, used, and managed as a single, unified resource

  5. Supported Platforms N1 Grid Engine Download and try it out free at http://gridengine.sunsource.net:

  6. Compute Elements Sun's End-to-end Product Line • Access systems • Thin clients, workstations • Compute nodes • Linux and Solaris Operating Systems • Compact 1U and 2U servers • Blade servers • Larger symmetric multiprocessing (SMP) systems • Sun Fire Superclusters • Pre-configured Grid Computing rack systems Sun Fire ComputeGrid rack system

  7. Sun Fire Compute Grid Engineered, Tested, Integrated, Supported • Up to 32 Sun Fire V20z, or Up to 10 V40z • Sun Control Station • Sun N1 Grid Engine Software • Upto 2 * 24port Gigabit Ethernet Switches • 48-port Terminal Server • Keyboard/Video/Mouse shelf unit • Sun Rack 1000-38

  8. Sun's Grid Software

  9. Software Elements Small to Large Grid Computing Solutions Service Discovery Global Grid Infrastructure OGSA, Globus Toolkit, Authentication/Authorization Avaki Industry Standards and partner technologies Data Management Enterprise Grid Infrastructure N1 Grid Engine Policy Management N1 Grid EngineSolarisTM Resource Manager Resource Management Sun Management CenterSun Control Station System Management Cluster Grid Infrastructure Sun QFS/SamFSSolaris CacheFS Data Access

  10. N1 Grid Engine 6 Distributed Resource Manager, Job scheduling • Policy management • Owners negotiate usage • 4 different, customizablepolicy schemes • Exceptions for specific needs • Benefits • Equitable, enforceablesharing between groups • Alignment of resourceswith business goals

  11. Sun Cluster Grid Manager Unified Remote System and Grid Management • Sun Control Station software • System health and performance monitoring • Pull, push, and automatic provisioning • Deploy both Linux and Solaris x86 images • Integrated grid management module • Manages Sun Grid Engine or Sun Grid Engine, Enterprise Edition • Aggregated Management • Address hundreds of systems individually or groups • Combined system, software, and grid management

  12. Sun Cluster Grid Manager

  13. Grid Engine Portal

  14. A Complete Solution Proven and Repeatable Reference Architectures Control Network (Gigabit Ethernet) Sun Compute Grid rack systems Sun ONE Grid Engine Servers Workstations Sun ClusterGrid Manager Data Network (Gigabit Ethernet) Sun StorEdge storage solutions (Direct-attached, NAS, HA-NFS, HPTC SAN)

  15. Grid Scalability from Local to Global Cluster, Enterprise, and Global Grids Global Grid Cluster Grid Internet Cluster Grid Enterprise Grid Enterprise Grid

  16. N1 Grid Engine 6 Technical Overview

  17. Agenda • N1 Grid Engine Overview • Architecture • Resource, data access • Application Intergration • N1GE6 New feature • Accounting & Reporting

  18. N1 Grid Engine Overview Resource Management Selection of Jobs Simple policies : FIFO,equal share, rank Sophisticated policies:sharing, urgency, priority, deadline,resource-based, etc Selection of Resources System characteristics: CPU,memory, OS, patches, etc. Status of systems: avail. mem,load, free disk space, etc. Status of other resources: licenses,shared storage, other software, etc. Grid Engine # # BLAST # blastall -p blastn -i /nfs/data

  19. N1 Grid Engine Overview Resource Control Control of jobs Suspend, Resume, Kill, Migrate, Restart Customizable action methods Manual or automated via policies Control of resources Regulate load from Grid jobs basedupon resource value thresholds Control access via permissions,time/date, jobtype Allocate systems to jobs based ontotal resource consumption(eg, memory, CPUs, disk, etc) Grid Engine # # BLAST # blastall -p blastn -i /nfs/data

  20. N1 Grid Engine Overview Resource Accounting Accounting of jobs Current resource consumptionalways monitored Total detailed consumption recorded at end of job Includes record of user, department, project, etc, Accounting of resources Current usage of resources onhosts always monitored Information recordedover time: resource utilizationof hosts, grid; grid configurationchanges Grid Engine # # BLAST # blastall -p blastn -i /nfs/data

  21. Exec Host Master Host execd Qmaster Schedd Grid Engine 6 Architecture Access Tier Management Tier Compute Tier Submit Host Admin Host SGE daemons Shadow Host? TCP/IP

  22. Built-in and custom resources • Static resources: strings, numbers, boolean • Countable resources: eg, licenses, MB of memory/disk • Measured resources: value provided through Load Sensor Resources used for • job resource request: job A needs 1 license and 1GB • Load/suspend thresholds: suspend jobs if load_avg > 1.5 • load formulas: send jobs to hosts with least load; out of those, choose hosts with most free memory Resources THE HEART OF GRID ENGINE MANAGEMENT Per Host • load_avg • mem_free • OS/patch-level Global • floating licenses • shared storage

  23. Parallel and CheckpointingEnvironments Environment a set of hosts that is used to support parallel or checkpointing applications applications must inherently support parallel/checkpointing execution H1 H2 H4 H6 H3 H5 H7

  24. Data Access App binaries CONFIGURED INDEPENDENTLY Job data Exec hosts Data Grid File staging NFS sharing

  25. Application Integration Methods General methods Parallel methods Checkpointing methods starter method parallel start queue/host prolog START requeue job parallel stop queue/host epilog migration command resume method suspend method Job run at specified intervals clean command terminate method checkpoint command parallel stop queue/host epilog END

  26. Integrating applications with Grid Engine • Unmodified/legacy application binaries:integrate using wrapper script • Interactive applications: use pluggable remote mechanisms, eg, ssh, rsh, telnettwo most common approaches • Grid-ready applications: modify code touse DRM APIsAPI recently standardized • Java applications: JGrid package for low-level coupling (object/method distribution)currently provided separately

  27. N1GE 6 New FeaturesArchitecture • Berkeley DB spooling • Multi-threaded Master Daemon • New communication system • Scalability goals: N1GE 6 per 1 master • Up to 10,000 unique hosts • Up to 500,000 unique jobs * Array Jobs counted as a single job

  28. N1GE 6 Supporting Platforms end.CY2004

  29. N1GE 6 New FeaturesScheduler Functionality • Advanced planning capabilities • Resource Reservation w/ Backfilling • Can reserve any resource, eg memory, CPU, license • More sophisticated scheduling algorithms • Management policies matched with business priorities: • Priority, urgency, share tree, category, deadline, etc

  30. Job Resource Reservation

  31. Lic. Mem. CPU Mem. CPU Time Simple, priority-based scheduling Global Job 4 Job 6 Job 2 Host 2 Wasted resources Job 2 Job 3 Job 2 Job 3 Job 2 Job 3 Job 2 Host 1 Job 6 Job 4 Job 2 Job 1 Job 5 Job 2 Job 4 Job 6 Job 2 Job 5 Job 2 Job 1

  32. Lic. Mem. CPU Mem. CPU Time Scheduling with Resource Reservation Global Job 2 Job 4 Job 6 Host 2 Job 3 Job 2 Job 3 Job 2 Job 2 Job 3 Job 2 Host 1 Job 5 Job 6 Job 2 Job 4 Job 1 Job 2 Job 6 Job 5 Job 2 Job 2 Job 4 Job 1

  33. Lic. Mem. CPU Mem. CPU Time Resource Reservation with backfilling Global Job 6 Job 2 Job 4 Host 2 Job 2 Job 3 Job 6 Job 2 Job 3 Job 2 Job 3 Job 2 Job 6 Host 1 Job 5 Job 2 Job 1 Job 2 Job 4 Job 5 Job 2 Job 2 Job 4 Job 1

  34. Resource Management Policies Resource allocation based upon business priorities • policy basis includes: cumulative utilization, category priority, time-based priority, resource value, etc • powerful, flexible, tunable, easy to configure All jobs High Priority Low Priority Normal Priority Dept A: 70 more rights to high priority jobs Dept B: 30 Dept B: 50 Dept B: 50 Dept A: 50 Group X: temporary boost Dept A: 50

  35. Policies for Job Prioritization Priority determines which pending jobs get dispatched Job priority calculated based on three sub-policies (normalized to 0.0 < N < 1.0): prio = Wurg Nurg + Wtix Ntix + Wpsx Npsx Nurg = normalized Urgency Ntix = normalized Tickets Npsx = normalized Posix W = weighting factors

  36. 6.x Cluster Queue 5.x Queue A B C D A B C D ... ... ... ... ... ... ... ... ... ... Cluster Queue Hosts:

  37. N1GE 6 New FeaturesAnalysis / Monitoring / Accounting • Value-add module for doing analysis, monitoring, accounting reports, etc. • Fine-grained resource recording • Stored in RDBMS in well-defined schema • provides built-in capability for reporting, chargeback, etc • Web-based console tool provided for generating reports, queries, etc.

  38. Why 2nd separated DB? • Different access considerations • Standardized access (SQL, ODBC, JDBC) • More powerful database structure • Independent of core system data • historical data • Derived data (sums, averages ...) • queries won't affect system performance • lower requirements on availability

  39. Qmaster Reporting File Reporting-Writer build derived values raw data Reporting-DB Architecture • Reporting-Writer:Java application • loosely coupled to the SGE system via qmaster-generated reporting file • Stores raw data,pre-processed data to SQL-DB via JDBC

  40. Stored Data • Job related information times, user, project, exit status ... • Host and queue related information load information, consumables ... • Sharetree configured shares, actual shares ... • Precomputed, derived values sums, averages per host, queue, user, project ...

  41. ARCo: Accounting and Reporting Console • Web-based tool for displaying data in reporting DB • Based on Sun Web Console • Ability to create simple and advanced (SQL-based) queries • Generates tables, graphs, exportable as CVS, PDF • Also, command-line report generation

  42. Selecting a query

  43. Query Results

  44. Defining new query

  45. jongjun.son@sun.com http://sun.com/grid

More Related